Hi.

Sorry for the weekend-delay.

First of all, we shall understand, uploading files to jira / bugzilla
is not something that gives us a right to distribute files under ASL
license. I mean, if user didn't give us permission to do so by setting
"patch" option, they are not under ASL license. More over, almost all
"real-life bugs files" are not owned by uploader. Fox example,
Bug33519.doc, Bug46610_3.doc, Bug46817.doc, Bug47286.doc
(Bug47287.doc), Bug47731.doc, Bug47958.doc, Bug48075.doc,
Bug49933.doc, Bug50936.doc are not the files allowed to be distributed
under ASL license neither. And I didn't even check images yet. My
point there is a lot of such files already, and 51524 is not something
"new". And if having such files prohibits building new version, we
shall not have new version until all those files are deleted from
*-src archives.

Other file types shall be checked as well. (For example, 12561-1.xls,
12561-2.xls, 12843-1.xls, etc.).

In addition, some of those files is not easy to replace. For example,
if some file is fast-saved, we can't create the file with the same
structure and replace provided file with it. We don't know yet if we
understand fast-save feature correctly. But tests against those files
need to be executed along with other tests, to be sure we didn't broke
fast-save feature parsing functionality.

I did read "ASF Legal Previously Asked Questions", but didn't found
anything prohibiting us from uploading such files to SVN _without_
including them in source redistribution. My point, if we move such
files to another directory and exclude them from source package, we
still can rely on them using remote SVN http access. But may be i
missed something from policies.

Best regards,
Sergey.

2011/8/21 Yegor Kozlov <[email protected]>:
> Antoni,
>
> Thanks for heads-up. For now I excluded the test document from Bug
> 51524 from POI's test files.
>
> PDFBox suggests a good pattern to follow - I mean a 'special' test
> suite that operates with remote files. I'm going to add something of
> that kind to POI too.
>
> Regards,
> Yegor
>
>
> On Tue, Aug 16, 2011 at 7:15 PM, Antoni Mylka
> <[email protected]> wrote:
>> Hi,
>>
>> I'm cc-ing this to dev@poi. I asked on dev@pdfbox about the policy for
>> handing test documents which are public, but not explicitly licensed to ASF
>> for "redistribution".
>>
>> W dniu 2011-08-16 14:29, Jukka Zitting pisze:
>>>
>>> Hi,
>>>
>>> On Tue, Aug 16, 2011 at 12:46 PM, Antoni Mylka
>>> <[email protected]>  wrote:
>>>>
>>>> Is this because pdfbox is liberal (don't require unit tests, keep the
>>>> barriers to patches low), or conservative (copyright on the pdfs is
>>>> tricky,
>>>> don't commit them)? Is there any "official" policy?
>>>
>>> Better test coverage is always a good thing and should be our goal.
>>>
>>> That said, many of the example PDF files we see (like the one on
>>> PDFBOX-1010) don't come with a license that would allow them to be
>>> redistributed as a part of an Apache project. See [1] for Apache
>>> guidelines on how to handle external material that hasn't explicitly
>>> been contributed for redistribution by the ASF.
>>
>>>
>>>
>>> See also [2] for related earlier work in dealing with test files with
>>> unknown or unacceptable licensing status.
>>>
>>>> I do much of my text-extraction regression testing on the "govdocs1"
>>>> dataset
>>>> [1,2,3,4]. There are on the order of 300 thousand PDFs in there. All have
>>>> been downloaded from public-facing websites owned by some US Government
>>>> organization. They are all public, yet the copyright cannot be
>>>> transferred
>>>> to ASF. Are they OK?
>>>
>>> This is probably a question best answered by [email protected].
>>> My intuition says that the best way to handle such material would be
>>> by reference. For example a test case could refer to specific
>>> documents within the corpus by path or document id, and would only be
>>> executed when the user has explicitly downloaded the corpus and made
>>> it available to the PDFBox build.
>>
>> There doesn't seem to be much information on any "external material" which
>> is not a library on the ASF Legal FAQ [1]. I guess I'd ask on legal-discuss.
>>
>> My idea is to include such tests in a separate suite which would download
>> the docs using some URL list. The suite would NOT run by default. It could
>> even lie outside the main source tree. URL lists can quickly get out of date
>> and a release must compile after 10 years. This would allow for automated
>> testing of docs from govdocs1 [3,4,5], JIRA issues, old pdfbox SF issues and
>> any public website stable enough to hold a file for a long time, everything
>> which by ASF policy cannot be committed to the SVN. Do you think it's a good
>> idea?
>>
>> The same problem applies to POI. I used a govdocs document as an example in
>> POI issue number 51524. Sergey Vladimirov committed it to Apache SVN. Now
>> Jukka says that it's unacceptable. Should the 51524 test be disabled and the
>> said file deleted?
>>
>> Antoni Myłka
>> [email protected]
>>
>> [1] http://www.apache.org/legal/resolved.html
>> [2] https://issues.apache.org/jira/browse/PDFBOX-391
>> [3] http://digitalcorpora.org/corpora/files
>> [4] http://www.dfrws.org/2009/proceedings/p2-garfinkel.pdf
>> [5] http://domex.nps.edu/corp/files/govdocs1/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>



-- 
Sergey Vladimirov

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to