Hi,

I tried investigating PDFBOX-1075, discovered that it's related to a fix applied to PDFBOX-1010, but the earlier fix did not come with a unit test and I had to download a doc from directly from the JIRA to see if my fix didn't break the earlier one.

Is this because pdfbox is liberal (don't require unit tests, keep the barriers to patches low), or conservative (copyright on the pdfs is tricky, don't commit them)? Is there any "official" policy?

I do much of my text-extraction regression testing on the "govdocs1" dataset [1,2,3,4]. There are on the order of 300 thousand PDFs in there. All have been downloaded from public-facing websites owned by some US Government organization. They are all public, yet the copyright cannot be transferred to ASF. Are they OK?

Antoni Myłka
[email protected]

Short description:
[1] http://digitalcorpora.org/corpora/files
Longer description:
[2] http://www.dfrws.org/2009/proceedings/p2-garfinkel.pdf
A million documents:
[3] http://domex.nps.edu/corp/files/govdocs1/
A million documents packaged into 1000 zip files
[4] http://domex.nps.edu/corp/files/govdocs1/zipfiles/

Reply via email to