Re: Regression Testing

John Hewson Tue, 08 Jul 2014 20:00:08 -0700

Hi Tim,

>  My initial plan for TIKA-1302 is very similar to what Tilman outlined, and 
> my understanding/concerns/thoughts were very much in line with what he 
> articulated.  The idea is that there should be a small Apache license-able 
> gold truth set like both projects now have for specific unit tests 
> (patient-based care), but that we should also occasionally take a 
> public-health view and compare the outputs of  different versions of our 
> parsers on a large set of docs to identify new exceptions or large changes in 
> extracted content/metadata.


I’m not aware of a good supply of Apache license-able PDF files, we have very 
few such tests currently. For regression tests to be useful we really have to 
run our tests on a large corpus of real files every time.

>   I'm persuaded by your points about fair use and the importance of "open 
> data."  Before proceeding on TIKA-1302, I'd like to get broader feedback on 
> the way ahead via legal-discuss or maybe jira's Legal.  Do you mind if I 
> quote your arguments?

Yes, certainly, obviously I’m not a lawyer. My reasoning is basically that 
Google do essentially the same thing that we want to and they have plenty of 
lawyers who presumably know what they’re doing.

>   Also, I was on my way to requesting a vm from infra for TIKA-1302.  Do you 
> see any way that we could share resources so that we're not double-storing 
> files on Apache infrastructure?  There may be easy ways to share some eval 
> code as well.

I was thinking of just storing our test files in an SVN branch, the Tika 
project should already have read access (obviously write access would be for 
PDFBox committers only otherwise our builds will get broken). The tests could 
run on Jenkins as part of the normal build process. For eval code I was  
planning to simply have a single paramaterized JUnit test which runs in 
parallel, that way it’s easy to run from an IDE and to debug and integrate with 
Maven. The unit test would look for source files in ../../regression which 
would be a directory above the SVN trunk (i.e. a separate repo). It would do a 
full rendering of each file to a PNG and compare the results, we’ll probably 
have a text extraction test too: perhaps that’s more like what Tika will need?

Thanks

-- John

Re: Regression Testing

Reply via email to