Antoni, Thanks for heads-up. For now I excluded the test document from Bug 51524 from POI's test files.
PDFBox suggests a good pattern to follow - I mean a 'special' test suite that operates with remote files. I'm going to add something of that kind to POI too. Regards, Yegor On Tue, Aug 16, 2011 at 7:15 PM, Antoni Mylka <[email protected]> wrote: > Hi, > > I'm cc-ing this to dev@poi. I asked on dev@pdfbox about the policy for > handing test documents which are public, but not explicitly licensed to ASF > for "redistribution". > > W dniu 2011-08-16 14:29, Jukka Zitting pisze: >> >> Hi, >> >> On Tue, Aug 16, 2011 at 12:46 PM, Antoni Mylka >> <[email protected]> wrote: >>> >>> Is this because pdfbox is liberal (don't require unit tests, keep the >>> barriers to patches low), or conservative (copyright on the pdfs is >>> tricky, >>> don't commit them)? Is there any "official" policy? >> >> Better test coverage is always a good thing and should be our goal. >> >> That said, many of the example PDF files we see (like the one on >> PDFBOX-1010) don't come with a license that would allow them to be >> redistributed as a part of an Apache project. See [1] for Apache >> guidelines on how to handle external material that hasn't explicitly >> been contributed for redistribution by the ASF. > >> >> >> See also [2] for related earlier work in dealing with test files with >> unknown or unacceptable licensing status. >> >>> I do much of my text-extraction regression testing on the "govdocs1" >>> dataset >>> [1,2,3,4]. There are on the order of 300 thousand PDFs in there. All have >>> been downloaded from public-facing websites owned by some US Government >>> organization. They are all public, yet the copyright cannot be >>> transferred >>> to ASF. Are they OK? >> >> This is probably a question best answered by [email protected]. >> My intuition says that the best way to handle such material would be >> by reference. For example a test case could refer to specific >> documents within the corpus by path or document id, and would only be >> executed when the user has explicitly downloaded the corpus and made >> it available to the PDFBox build. > > There doesn't seem to be much information on any "external material" which > is not a library on the ASF Legal FAQ [1]. I guess I'd ask on legal-discuss. > > My idea is to include such tests in a separate suite which would download > the docs using some URL list. The suite would NOT run by default. It could > even lie outside the main source tree. URL lists can quickly get out of date > and a release must compile after 10 years. This would allow for automated > testing of docs from govdocs1 [3,4,5], JIRA issues, old pdfbox SF issues and > any public website stable enough to hold a file for a long time, everything > which by ASF policy cannot be committed to the SVN. Do you think it's a good > idea? > > The same problem applies to POI. I used a govdocs document as an example in > POI issue number 51524. Sergey Vladimirov committed it to Apache SVN. Now > Jukka says that it's unacceptable. Should the 51524 test be disabled and the > said file deleted? > > Antoni Myłka > [email protected] > > [1] http://www.apache.org/legal/resolved.html > [2] https://issues.apache.org/jira/browse/PDFBOX-391 > [3] http://digitalcorpora.org/corpora/files > [4] http://www.dfrws.org/2009/proceedings/p2-garfinkel.pdf > [5] http://domex.nps.edu/corp/files/govdocs1/ > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
