Hi, On Tue, Aug 16, 2011 at 12:46 PM, Antoni Mylka <[email protected]> wrote: > Is this because pdfbox is liberal (don't require unit tests, keep the > barriers to patches low), or conservative (copyright on the pdfs is tricky, > don't commit them)? Is there any "official" policy?
Better test coverage is always a good thing and should be our goal. That said, many of the example PDF files we see (like the one on PDFBOX-1010) don't come with a license that would allow them to be redistributed as a part of an Apache project. See [1] for Apache guidelines on how to handle external material that hasn't explicitly been contributed for redistribution by the ASF. See also [2] for related earlier work in dealing with test files with unknown or unacceptable licensing status. > I do much of my text-extraction regression testing on the "govdocs1" dataset > [1,2,3,4]. There are on the order of 300 thousand PDFs in there. All have > been downloaded from public-facing websites owned by some US Government > organization. They are all public, yet the copyright cannot be transferred > to ASF. Are they OK? This is probably a question best answered by [email protected]. My intuition says that the best way to handle such material would be by reference. For example a test case could refer to specific documents within the corpus by path or document id, and would only be executed when the user has explicitly downloaded the corpus and made it available to the PDFBox build. [1] http://www.apache.org/legal/resolved.html [2] https://issues.apache.org/jira/browse/PDFBOX-391 BR, Jukka Zitting
