Hi,

On Tue, Aug 16, 2011 at 12:46 PM, Antoni Mylka
<[email protected]> wrote:
> Is this because pdfbox is liberal (don't require unit tests, keep the
> barriers to patches low), or conservative (copyright on the pdfs is tricky,
> don't commit them)? Is there any "official" policy?

Better test coverage is always a good thing and should be our goal.

That said, many of the example PDF files we see (like the one on
PDFBOX-1010) don't come with a license that would allow them to be
redistributed as a part of an Apache project. See [1] for Apache
guidelines on how to handle external material that hasn't explicitly
been contributed for redistribution by the ASF.

See also [2] for related earlier work in dealing with test files with
unknown or unacceptable licensing status.

> I do much of my text-extraction regression testing on the "govdocs1" dataset
> [1,2,3,4]. There are on the order of 300 thousand PDFs in there. All have
> been downloaded from public-facing websites owned by some US Government
> organization. They are all public, yet the copyright cannot be transferred
> to ASF. Are they OK?

This is probably a question best answered by [email protected].
My intuition says that the best way to handle such material would be
by reference. For example a test case could refer to specific
documents within the corpus by path or document id, and would only be
executed when the user has explicitly downloaded the corpus and made
it available to the PDFBox build.

[1] http://www.apache.org/legal/resolved.html
[2] https://issues.apache.org/jira/browse/PDFBOX-391

BR,

Jukka Zitting

Reply via email to