John, My initial plan for TIKA-1302 is very similar to what Tilman outlined, and my understanding/concerns/thoughts were very much in line with what he articulated. The idea is that there should be a small Apache license-able gold truth set like both projects now have for specific unit tests (patient-based care), but that we should also occasionally take a public-health view and compare the outputs of different versions of our parsers on a large set of docs to identify new exceptions or large changes in extracted content/metadata.
I'm persuaded by your points about fair use and the importance of "open data." Before proceeding on TIKA-1302, I'd like to get broader feedback on the way ahead via legal-discuss or maybe jira's Legal. Do you mind if I quote your arguments? Also, I was on my way to requesting a vm from infra for TIKA-1302. Do you see any way that we could share resources so that we're not double-storing files on Apache infrastructure? There may be easy ways to share some eval code as well. Best, Tim -----Original Message----- From: John Hewson [mailto:j...@jahewson.com] Sent: Saturday, July 05, 2014 5:01 PM To: dev@pdfbox.apache.org Subject: Re: Regression Testing On 5 Jul 2014, at 13:47, Tilman Hausherr <thaush...@t-online.de> wrote: > Am 05.07.2014 22:12, schrieb John Hewson: >>>>> Copyrights is a problem: I'm testing mostly with JIRA attachments that >>>>> I've downloaded over the years. While uploading such files to JIRA might >>>>> count as fair use, I doubt that this would still be true if they are >>>>> included in a distribution. Instead, they should be stored somewhere on >>>>> Apache servers where only committers and build software ("Travis", >>>>> "Jenkins", ...) can access then. The public PDFs that Maruan mentions >>>>> don't possibly have all the Problem cases that we solved before. However >>>>> I have started working with these files and there are at least 5 recent >>>>> issues that deals with them. >>>> The PDFs won't be in a distribution. They will just happen to be stored in >>>> an SVN repo but not our source code repo, in the same way that the website >>>> is stored in the "cmssite" branch of SVN or indeed, are on JIRA. The law >>>> doesn't distinguish between JIRA and SVN, both are publicly available via >>>> HTTP, so using SVN will simply be a continuation of what we're already >>>> doing with JIRA. >>>> >>>> The crucial factor is that we're only storing publicly available PDFs, >>>> because we have the right to do so, just like Google's cache, and like we >>>> currently do with JIRA. >>> Yes but many PDFs we got aren't really "public". If this svn repo is only >>> accessible to committers, and if the publicly available build scripts won't >>> break because of this, then it is OK. >> Any non-public PDFs will not be permitted in our test suite, just as they >> shouldn't be on JIRA. >> >>> Note that even if something is "publicly available", it may still be >>> copyrighted. Other risks can be that some people upload PDFs that include >>> personal data. One really good test PDF was apparently a loan application. >>> I remember that the user insisted that 1. it was test data, and 2. that it >>> be removed. >> All Apache development should be in the open, this is a key ASF principle, >> having a committers-only test suite is basically a no-no. It's important to >> understand that "fair use" allows us to use copyrighted works - this is >> expressly permitted, it's the same legal principle as Google's cache. There >> is no need to seek permission. This is what we've been doing with JIRA >> already for years, so we are already doing this - it's fine. > > The problem is that this has all happened before. A few years ago, many files > were deleted, see PDFBOX-391. That issue is about including files in the source code repo as part of the PDFBox distribution, where there is a need to put files under an Apache 2.0 compatible license. What I'm advocating is keeping a separate public repository of test files which are not a part of the PDFBox source, like we currently have on JIRA. -- John