On Mon, Aug 22, 2011 at 12:26 PM, Sergey Vladimirov <[email protected]> wrote:
> Hi.
>
> Sorry for the weekend-delay.
>
> First of all, we shall understand, uploading files to jira / bugzilla
> is not something that gives us a right to distribute files under ASL
> license. I mean, if user didn't give us permission to do so by setting
> "patch" option, they are not under ASL license. More over, almost all
> "real-life bugs files" are not owned by uploader. Fox example,
> Bug33519.doc, Bug46610_3.doc, Bug46817.doc, Bug47286.doc
> (Bug47287.doc), Bug47731.doc, Bug47958.doc, Bug48075.doc,
> Bug49933.doc, Bug50936.doc are not the files allowed to be distributed
> under ASL license neither. And I didn't even check images yet. My
> point there is a lot of such files already, and 51524 is not something
> "new". And if having such files prohibits building new version, we
> shall not have new version until all those files are deleted from
> *-src archives.

I don't see anything wrong with Bug33519.doc, Bug46610_3.doc and
others. The fact that a file isn't owned by the bug reporter does not
mean we cannot keep it in SVN and include in our distros.

We should keep alert for two cases:

 (1) file contents explicitly states it's distribution policy, e.g. if
file footer says it is GPL-ed then we can't include it in the project
 (2) file originates from a external URL and that URL (or its parent)
has some restrictions. For example, we can't include files downloaded
from web sites licensed under GPL.

I removed 51524.zip just to be on the safe side. It a potential
blocker and I'd rather not continue with release until it is fixed.

Let's ask [email protected] if we can keep such files in svn
but exclude from distros. If we can, it would be ideal solution.

Yegor

>
> Other file types shall be checked as well. (For example, 12561-1.xls,
> 12561-2.xls, 12843-1.xls, etc.).
>
> In addition, some of those files is not easy to replace. For example,
> if some file is fast-saved, we can't create the file with the same
> structure and replace provided file with it. We don't know yet if we
> understand fast-save feature correctly. But tests against those files
> need to be executed along with other tests, to be sure we didn't broke
> fast-save feature parsing functionality.
>
> I did read "ASF Legal Previously Asked Questions", but didn't found
> anything prohibiting us from uploading such files to SVN _without_
> including them in source redistribution. My point, if we move such
> files to another directory and exclude them from source package, we
> still can rely on them using remote SVN http access. But may be i
> missed something from policies.
>
> Best regards,
> Sergey.
>
> 2011/8/21 Yegor Kozlov <[email protected]>:
>> Antoni,
>>
>> Thanks for heads-up. For now I excluded the test document from Bug
>> 51524 from POI's test files.
>>
>> PDFBox suggests a good pattern to follow - I mean a 'special' test
>> suite that operates with remote files. I'm going to add something of
>> that kind to POI too.
>>
>> Regards,
>> Yegor
>>
>>
>> On Tue, Aug 16, 2011 at 7:15 PM, Antoni Mylka
>> <[email protected]> wrote:
>>> Hi,
>>>
>>> I'm cc-ing this to dev@poi. I asked on dev@pdfbox about the policy for
>>> handing test documents which are public, but not explicitly licensed to ASF
>>> for "redistribution".
>>>
>>> W dniu 2011-08-16 14:29, Jukka Zitting pisze:
>>>>
>>>> Hi,
>>>>
>>>> On Tue, Aug 16, 2011 at 12:46 PM, Antoni Mylka
>>>> <[email protected]>  wrote:
>>>>>
>>>>> Is this because pdfbox is liberal (don't require unit tests, keep the
>>>>> barriers to patches low), or conservative (copyright on the pdfs is
>>>>> tricky,
>>>>> don't commit them)? Is there any "official" policy?
>>>>
>>>> Better test coverage is always a good thing and should be our goal.
>>>>
>>>> That said, many of the example PDF files we see (like the one on
>>>> PDFBOX-1010) don't come with a license that would allow them to be
>>>> redistributed as a part of an Apache project. See [1] for Apache
>>>> guidelines on how to handle external material that hasn't explicitly
>>>> been contributed for redistribution by the ASF.
>>>
>>>>
>>>>
>>>> See also [2] for related earlier work in dealing with test files with
>>>> unknown or unacceptable licensing status.
>>>>
>>>>> I do much of my text-extraction regression testing on the "govdocs1"
>>>>> dataset
>>>>> [1,2,3,4]. There are on the order of 300 thousand PDFs in there. All have
>>>>> been downloaded from public-facing websites owned by some US Government
>>>>> organization. They are all public, yet the copyright cannot be
>>>>> transferred
>>>>> to ASF. Are they OK?
>>>>
>>>> This is probably a question best answered by [email protected].
>>>> My intuition says that the best way to handle such material would be
>>>> by reference. For example a test case could refer to specific
>>>> documents within the corpus by path or document id, and would only be
>>>> executed when the user has explicitly downloaded the corpus and made
>>>> it available to the PDFBox build.
>>>
>>> There doesn't seem to be much information on any "external material" which
>>> is not a library on the ASF Legal FAQ [1]. I guess I'd ask on legal-discuss.
>>>
>>> My idea is to include such tests in a separate suite which would download
>>> the docs using some URL list. The suite would NOT run by default. It could
>>> even lie outside the main source tree. URL lists can quickly get out of date
>>> and a release must compile after 10 years. This would allow for automated
>>> testing of docs from govdocs1 [3,4,5], JIRA issues, old pdfbox SF issues and
>>> any public website stable enough to hold a file for a long time, everything
>>> which by ASF policy cannot be committed to the SVN. Do you think it's a good
>>> idea?
>>>
>>> The same problem applies to POI. I used a govdocs document as an example in
>>> POI issue number 51524. Sergey Vladimirov committed it to Apache SVN. Now
>>> Jukka says that it's unacceptable. Should the 51524 test be disabled and the
>>> said file deleted?
>>>
>>> Antoni Myłka
>>> [email protected]
>>>
>>> [1] http://www.apache.org/legal/resolved.html
>>> [2] https://issues.apache.org/jira/browse/PDFBOX-391
>>> [3] http://digitalcorpora.org/corpora/files
>>> [4] http://www.dfrws.org/2009/proceedings/p2-garfinkel.pdf
>>> [5] http://domex.nps.edu/corp/files/govdocs1/
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>
>
>
> --
> Sergey Vladimirov
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to