Hi all, this time the progress with the testing for 1.6.0 is rather slow. Most tests are done now and I believe we are in a good shape to build RC3. Anyway it would have bee better to be at that stage month ago.
To improve the situation in the future I would like to propose to automate all tests which can be run against data which is publicly available. These tests are all set up following the same pattern, they train a component on a corpus and afterwards evaluate against it. If the results matches the result of the previous release we hope the code doesn't contain any regressions. In some cases we have changes which influence the performance (e.g. bug fixes) in that case we adjust the expected performance score and carefully test that a particular change caused it. We sometimes have changes which shouldn't influence the performance of a component but still do due to some mistakes. These we need to identify during testing. The big issue we have with testing against public data is that we usually can't include the data in the OpenNLP release because of their license. And today we just do all the work manually by training on a corpus and afterwards running the built in evaluation against the model. I suggest we write JUnit tests which are doing this in case the user has the right corpus for the test. Those tests will be disabled by default and can be run by providing the -Dtest property and the location of the data director. For example. mvn test -Dtest=Conll06* -DOPENNLP_CORPUS_DIR=/home/admin/opennlp-data The tests will do all the work and fail if the expected results don't match. Automating those tests has the great advantage that we can run them much more frequently during the development phase and hopefully identify bugs before we even start with the release process. Addionally we might be able to run that on our build server. Any opinions? Jörn