Re: Automated testing with public data

Richard Eckart de Castilho Tue, 14 Apr 2015 14:56:20 -0700

If the unit test automatically download publicly accessible test data,
run the tests, and optionally delete the data afterwards, then the
test data does not have to be redistributed. Instead of deleting,
it might be even a good idea to cache the data to a) avoid hammering
the remote source and b) still have a local copy in case the source
fails.


I believe several cases have been discussed on the legal mailing list
where non-essential or test-only resources that were not part of the
release could be under licenses that would not be deemed compatible
with the Apache license. My understanding is that the release needs
to be untained and the downstream users must be able to trust that
they incur no license restrictions beyond the ASL. 

Cheers,

-- Richard

On 14.04.2015, at 23:47, Joern Kottmann <kottm...@gmail.com> wrote:

> Hi all,
> 
> this time the progress with the testing for 1.6.0 is rather slow. Most
> tests are done now and I believe we are in a good shape to build RC3.
> Anyway it would have bee better to be at that stage month ago.
> 
> To improve the situation in the future I would like to propose to automate
> all tests which can be run against data which is publicly available. These
> tests are all set up following the same pattern, they train a component on
> a corpus and afterwards evaluate against it. If the results matches the
> result of the previous release we hope the code doesn't contain any
> regressions. In some cases we have changes which influence the performance
> (e.g. bug fixes) in that case we adjust the expected performance score and
> carefully test that a particular change caused it.
> 
> We sometimes have changes which shouldn't influence the performance of a
> component but still do due to some mistakes. These we need to identify
> during testing.
> 
> The big issue we have with testing against public data is that we usually
> can't include the data in the OpenNLP release because of their license. And
> today we just do all the work manually by training on a corpus and
> afterwards running the built in evaluation against the model.
> 
> I suggest we write JUnit tests which are doing this in case the user has
> the right corpus for the test. Those tests will be disabled by default and
> can be run by providing the -Dtest property and the location of the data
> director.
> 
> For example.
> mvn test -Dtest=Conll06* -DOPENNLP_CORPUS_DIR=/home/admin/opennlp-data
> 
> The tests will do all the work and fail if the expected results don't match.
> 
> Automating those tests has the great advantage that we can run them much
> more frequently during the development phase and hopefully identify bugs
> before we even start with the release process.
> Addionally we might be able to run that on our build server.
> 
> Any opinions?
> 
> Jörn

Re: Automated testing with public data

Reply via email to