Test documents

Antoni Mylka Tue, 16 Aug 2011 03:47:17 -0700

Hi,

I tried investigating PDFBOX-1075, discovered that it's related to a fixapplied to PDFBOX-1010, but the earlier fix did not come with a unittest and I had to download a doc from directly from the JIRA to see ifmy fix didn't break the earlier one.

Is this because pdfbox is liberal (don't require unit tests, keep thebarriers to patches low), or conservative (copyright on the pdfs istricky, don't commit them)? Is there any "official" policy?

I do much of my text-extraction regression testing on the "govdocs1"dataset [1,2,3,4]. There are on the order of 300 thousand PDFs in there.All have been downloaded from public-facing websites owned by some USGovernment organization. They are all public, yet the copyright cannotbe transferred to ASF. Are they OK?


Antoni Myłka
[email protected]

Short description:
[1] http://digitalcorpora.org/corpora/files
Longer description:
[2] http://www.dfrws.org/2009/proceedings/p2-garfinkel.pdf
A million documents:
[3] http://domex.nps.edu/corp/files/govdocs1/
A million documents packaged into 1000 zip files
[4] http://domex.nps.edu/corp/files/govdocs1/zipfiles/

Test documents

Reply via email to