Antoni,

Thanks for heads-up. For now I excluded the test document from Bug
51524 from POI's test files.

PDFBox suggests a good pattern to follow - I mean a 'special' test
suite that operates with remote files. I'm going to add something of
that kind to POI too.

Regards,
Yegor


On Tue, Aug 16, 2011 at 7:15 PM, Antoni Mylka
<[email protected]> wrote:
> Hi,
>
> I'm cc-ing this to dev@poi. I asked on dev@pdfbox about the policy for
> handing test documents which are public, but not explicitly licensed to ASF
> for "redistribution".
>
> W dniu 2011-08-16 14:29, Jukka Zitting pisze:
>>
>> Hi,
>>
>> On Tue, Aug 16, 2011 at 12:46 PM, Antoni Mylka
>> <[email protected]>  wrote:
>>>
>>> Is this because pdfbox is liberal (don't require unit tests, keep the
>>> barriers to patches low), or conservative (copyright on the pdfs is
>>> tricky,
>>> don't commit them)? Is there any "official" policy?
>>
>> Better test coverage is always a good thing and should be our goal.
>>
>> That said, many of the example PDF files we see (like the one on
>> PDFBOX-1010) don't come with a license that would allow them to be
>> redistributed as a part of an Apache project. See [1] for Apache
>> guidelines on how to handle external material that hasn't explicitly
>> been contributed for redistribution by the ASF.
>
>>
>>
>> See also [2] for related earlier work in dealing with test files with
>> unknown or unacceptable licensing status.
>>
>>> I do much of my text-extraction regression testing on the "govdocs1"
>>> dataset
>>> [1,2,3,4]. There are on the order of 300 thousand PDFs in there. All have
>>> been downloaded from public-facing websites owned by some US Government
>>> organization. They are all public, yet the copyright cannot be
>>> transferred
>>> to ASF. Are they OK?
>>
>> This is probably a question best answered by [email protected].
>> My intuition says that the best way to handle such material would be
>> by reference. For example a test case could refer to specific
>> documents within the corpus by path or document id, and would only be
>> executed when the user has explicitly downloaded the corpus and made
>> it available to the PDFBox build.
>
> There doesn't seem to be much information on any "external material" which
> is not a library on the ASF Legal FAQ [1]. I guess I'd ask on legal-discuss.
>
> My idea is to include such tests in a separate suite which would download
> the docs using some URL list. The suite would NOT run by default. It could
> even lie outside the main source tree. URL lists can quickly get out of date
> and a release must compile after 10 years. This would allow for automated
> testing of docs from govdocs1 [3,4,5], JIRA issues, old pdfbox SF issues and
> any public website stable enough to hold a file for a long time, everything
> which by ASF policy cannot be committed to the SVN. Do you think it's a good
> idea?
>
> The same problem applies to POI. I used a govdocs document as an example in
> POI issue number 51524. Sergey Vladimirov committed it to Apache SVN. Now
> Jukka says that it's unacceptable. Should the 51524 test be disabled and the
> said file deleted?
>
> Antoni Myłka
> [email protected]
>
> [1] http://www.apache.org/legal/resolved.html
> [2] https://issues.apache.org/jira/browse/PDFBOX-391
> [3] http://digitalcorpora.org/corpora/files
> [4] http://www.dfrws.org/2009/proceedings/p2-garfinkel.pdf
> [5] http://domex.nps.edu/corp/files/govdocs1/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to