You can find our current documentation here: http://incubator.apache.org/opennlp/documentation/manual/opennlp.html
Why do you have more events and less outcomes in your second run? In 1.5.1 we now have built-in converters for conll06, you can see how to use it with this command: bin/opennlp POSTaggerConverter conllx It is still not described in our documentation, but any help is welcome. Jörn On 7/26/11 1:26 AM, vishvAs vAsuki wrote:
Here is an observation about the MAXENT tagger which may be of interest to others. I recently tried to replicate the tagging results described in the wiki<http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Conll06#Train_a_tokenizer_model>( http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Conll06#Train_a_tokenizer_model), while calling the tagging API<http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Postagger>from my Scala code. As in the case of the command line tool, I was using the parameters numIterations = 100 and event-threshold = 5. The only difference was in how the sample-stream passed to the tagging API was created: I was using my own scala code to create the sample stream (which looked fine to the naked eye). But, my code was reading the words in all CAPS. This resulted in a slight but noticeable decline in performance: eg: 0.96 vs 0.95. (More detailed output appended.) Note that the sample stream for both the test and training data were in CAPS - so maybe the model treats “Port” and “port” differently. === Command line case=== Sorting and merging events... done. Reduced 206678 events to 193001. ... Number of Event Tokens: 193001 Number of Outcomes: 22 Number of Predicates: 29155 ...done. Computing model parameters... Performing 100 iterations. 1: .. loglikelihood=-638850.4721742678 0.13807468622688432 .. 100: .. loglikelihood=-13827.506953520902 0.9901537657612325 Accuracy: 0.9659110277825124 === My code=== Sorting and merging events... done. Reduced 206678 events to 193059. Done indexing. Incorporating indexed data for training... done. Number of Event Tokens: 193059 Number of Outcomes: 16 Number of Predicates: 27709 ...done. Computing model parameters... Performing 100 iterations. 1: .. loglikelihood=-573033.0919349034 0.13807468622688432 .. 100: .. loglikelihood=-18019.22974368408 0.9831041523529355 Evaluating ... Accuracy: 0.9500596557013806 -- Cheers, vishvAs