Hi Tim, > My initial plan for TIKA-1302 is very similar to what Tilman outlined, and > my understanding/concerns/thoughts were very much in line with what he > articulated. The idea is that there should be a small Apache license-able > gold truth set like both projects now have for specific unit tests > (patient-based care), but that we should also occasionally take a > public-health view and compare the outputs of different versions of our > parsers on a large set of docs to identify new exceptions or large changes in > extracted content/metadata.
I’m not aware of a good supply of Apache license-able PDF files, we have very few such tests currently. For regression tests to be useful we really have to run our tests on a large corpus of real files every time. > I'm persuaded by your points about fair use and the importance of "open > data." Before proceeding on TIKA-1302, I'd like to get broader feedback on > the way ahead via legal-discuss or maybe jira's Legal. Do you mind if I > quote your arguments? Yes, certainly, obviously I’m not a lawyer. My reasoning is basically that Google do essentially the same thing that we want to and they have plenty of lawyers who presumably know what they’re doing. > Also, I was on my way to requesting a vm from infra for TIKA-1302. Do you > see any way that we could share resources so that we're not double-storing > files on Apache infrastructure? There may be easy ways to share some eval > code as well. I was thinking of just storing our test files in an SVN branch, the Tika project should already have read access (obviously write access would be for PDFBox committers only otherwise our builds will get broken). The tests could run on Jenkins as part of the normal build process. For eval code I was planning to simply have a single paramaterized JUnit test which runs in parallel, that way it’s easy to run from an IDE and to debug and integrate with Maven. The unit test would look for source files in ../../regression which would be a directory above the SVN trunk (i.e. a separate repo). It would do a full rendering of each file to a PNG and compare the results, we’ll probably have a text extraction test too: perhaps that’s more like what Tika will need? Thanks -- John