On Mon, Aug 22, 2011 at 12:26 PM, Sergey Vladimirov <[email protected]> wrote: > Hi. > > Sorry for the weekend-delay. > > First of all, we shall understand, uploading files to jira / bugzilla > is not something that gives us a right to distribute files under ASL > license. I mean, if user didn't give us permission to do so by setting > "patch" option, they are not under ASL license. More over, almost all > "real-life bugs files" are not owned by uploader. Fox example, > Bug33519.doc, Bug46610_3.doc, Bug46817.doc, Bug47286.doc > (Bug47287.doc), Bug47731.doc, Bug47958.doc, Bug48075.doc, > Bug49933.doc, Bug50936.doc are not the files allowed to be distributed > under ASL license neither. And I didn't even check images yet. My > point there is a lot of such files already, and 51524 is not something > "new". And if having such files prohibits building new version, we > shall not have new version until all those files are deleted from > *-src archives.
I don't see anything wrong with Bug33519.doc, Bug46610_3.doc and others. The fact that a file isn't owned by the bug reporter does not mean we cannot keep it in SVN and include in our distros. We should keep alert for two cases: (1) file contents explicitly states it's distribution policy, e.g. if file footer says it is GPL-ed then we can't include it in the project (2) file originates from a external URL and that URL (or its parent) has some restrictions. For example, we can't include files downloaded from web sites licensed under GPL. I removed 51524.zip just to be on the safe side. It a potential blocker and I'd rather not continue with release until it is fixed. Let's ask [email protected] if we can keep such files in svn but exclude from distros. If we can, it would be ideal solution. Yegor > > Other file types shall be checked as well. (For example, 12561-1.xls, > 12561-2.xls, 12843-1.xls, etc.). > > In addition, some of those files is not easy to replace. For example, > if some file is fast-saved, we can't create the file with the same > structure and replace provided file with it. We don't know yet if we > understand fast-save feature correctly. But tests against those files > need to be executed along with other tests, to be sure we didn't broke > fast-save feature parsing functionality. > > I did read "ASF Legal Previously Asked Questions", but didn't found > anything prohibiting us from uploading such files to SVN _without_ > including them in source redistribution. My point, if we move such > files to another directory and exclude them from source package, we > still can rely on them using remote SVN http access. But may be i > missed something from policies. > > Best regards, > Sergey. > > 2011/8/21 Yegor Kozlov <[email protected]>: >> Antoni, >> >> Thanks for heads-up. For now I excluded the test document from Bug >> 51524 from POI's test files. >> >> PDFBox suggests a good pattern to follow - I mean a 'special' test >> suite that operates with remote files. I'm going to add something of >> that kind to POI too. >> >> Regards, >> Yegor >> >> >> On Tue, Aug 16, 2011 at 7:15 PM, Antoni Mylka >> <[email protected]> wrote: >>> Hi, >>> >>> I'm cc-ing this to dev@poi. I asked on dev@pdfbox about the policy for >>> handing test documents which are public, but not explicitly licensed to ASF >>> for "redistribution". >>> >>> W dniu 2011-08-16 14:29, Jukka Zitting pisze: >>>> >>>> Hi, >>>> >>>> On Tue, Aug 16, 2011 at 12:46 PM, Antoni Mylka >>>> <[email protected]> wrote: >>>>> >>>>> Is this because pdfbox is liberal (don't require unit tests, keep the >>>>> barriers to patches low), or conservative (copyright on the pdfs is >>>>> tricky, >>>>> don't commit them)? Is there any "official" policy? >>>> >>>> Better test coverage is always a good thing and should be our goal. >>>> >>>> That said, many of the example PDF files we see (like the one on >>>> PDFBOX-1010) don't come with a license that would allow them to be >>>> redistributed as a part of an Apache project. See [1] for Apache >>>> guidelines on how to handle external material that hasn't explicitly >>>> been contributed for redistribution by the ASF. >>> >>>> >>>> >>>> See also [2] for related earlier work in dealing with test files with >>>> unknown or unacceptable licensing status. >>>> >>>>> I do much of my text-extraction regression testing on the "govdocs1" >>>>> dataset >>>>> [1,2,3,4]. There are on the order of 300 thousand PDFs in there. All have >>>>> been downloaded from public-facing websites owned by some US Government >>>>> organization. They are all public, yet the copyright cannot be >>>>> transferred >>>>> to ASF. Are they OK? >>>> >>>> This is probably a question best answered by [email protected]. >>>> My intuition says that the best way to handle such material would be >>>> by reference. For example a test case could refer to specific >>>> documents within the corpus by path or document id, and would only be >>>> executed when the user has explicitly downloaded the corpus and made >>>> it available to the PDFBox build. >>> >>> There doesn't seem to be much information on any "external material" which >>> is not a library on the ASF Legal FAQ [1]. I guess I'd ask on legal-discuss. >>> >>> My idea is to include such tests in a separate suite which would download >>> the docs using some URL list. The suite would NOT run by default. It could >>> even lie outside the main source tree. URL lists can quickly get out of date >>> and a release must compile after 10 years. This would allow for automated >>> testing of docs from govdocs1 [3,4,5], JIRA issues, old pdfbox SF issues and >>> any public website stable enough to hold a file for a long time, everything >>> which by ASF policy cannot be committed to the SVN. Do you think it's a good >>> idea? >>> >>> The same problem applies to POI. I used a govdocs document as an example in >>> POI issue number 51524. Sergey Vladimirov committed it to Apache SVN. Now >>> Jukka says that it's unacceptable. Should the 51524 test be disabled and the >>> said file deleted? >>> >>> Antoni Myłka >>> [email protected] >>> >>> [1] http://www.apache.org/legal/resolved.html >>> [2] https://issues.apache.org/jira/browse/PDFBOX-391 >>> [3] http://digitalcorpora.org/corpora/files >>> [4] http://www.dfrws.org/2009/proceedings/p2-garfinkel.pdf >>> [5] http://domex.nps.edu/corp/files/govdocs1/ >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> > > > > -- > Sergey Vladimirov > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
