Hi,
I'm cc-ing this to dev@poi. I asked on dev@pdfbox about the policy for
handing test documents which are public, but not explicitly licensed to
ASF for "redistribution".
W dniu 2011-08-16 14:29, Jukka Zitting pisze:
Hi,
On Tue, Aug 16, 2011 at 12:46 PM, Antoni Mylka
<[email protected]> wrote:
Is this because pdfbox is liberal (don't require unit tests, keep the
barriers to patches low), or conservative (copyright on the pdfs is tricky,
don't commit them)? Is there any "official" policy?
Better test coverage is always a good thing and should be our goal.
That said, many of the example PDF files we see (like the one on
PDFBOX-1010) don't come with a license that would allow them to be
redistributed as a part of an Apache project. See [1] for Apache
guidelines on how to handle external material that hasn't explicitly
been contributed for redistribution by the ASF.
>
See also [2] for related earlier work in dealing with test files with
unknown or unacceptable licensing status.
I do much of my text-extraction regression testing on the "govdocs1" dataset
[1,2,3,4]. There are on the order of 300 thousand PDFs in there. All have
been downloaded from public-facing websites owned by some US Government
organization. They are all public, yet the copyright cannot be transferred
to ASF. Are they OK?
This is probably a question best answered by [email protected].
My intuition says that the best way to handle such material would be
by reference. For example a test case could refer to specific
documents within the corpus by path or document id, and would only be
executed when the user has explicitly downloaded the corpus and made
it available to the PDFBox build.
There doesn't seem to be much information on any "external material"
which is not a library on the ASF Legal FAQ [1]. I guess I'd ask on
legal-discuss.
My idea is to include such tests in a separate suite which would
download the docs using some URL list. The suite would NOT run by
default. It could even lie outside the main source tree. URL lists can
quickly get out of date and a release must compile after 10 years. This
would allow for automated testing of docs from govdocs1 [3,4,5], JIRA
issues, old pdfbox SF issues and any public website stable enough to
hold a file for a long time, everything which by ASF policy cannot be
committed to the SVN. Do you think it's a good idea?
The same problem applies to POI. I used a govdocs document as an example
in POI issue number 51524. Sergey Vladimirov committed it to Apache SVN.
Now Jukka says that it's unacceptable. Should the 51524 test be disabled
and the said file deleted?
Antoni Myłka
[email protected]
[1] http://www.apache.org/legal/resolved.html
[2] https://issues.apache.org/jira/browse/PDFBOX-391
[3] http://digitalcorpora.org/corpora/files
[4] http://www.dfrws.org/2009/proceedings/p2-garfinkel.pdf
[5] http://domex.nps.edu/corp/files/govdocs1/