Re: An observation about the MAXENT tagger and CAPS

Jörn Kottmann Tue, 26 Jul 2011 01:24:25 -0700

You can find our current documentation here:
http://incubator.apache.org/opennlp/documentation/manual/opennlp.html


Why do you have more events and less outcomes in your second run?

In 1.5.1 we now have built-in converters for conll06, you can see how
to use it with this command:
bin/opennlp POSTaggerConverter conllx

It is still not described in our documentation,
but any help is welcome.

Jörn

On 7/26/11 1:26 AM, vishvAs vAsuki wrote:

Here is an observation about the MAXENT tagger which may be of interest to
others.

I recently tried to replicate the tagging results described in the
wiki<http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Conll06#Train_a_tokenizer_model>(
http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Conll06#Train_a_tokenizer_model),
while calling the tagging
API<http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Postagger>from
my Scala code. As in the case of the command line tool, I was using
the
parameters numIterations = 100 and event-threshold = 5. The only difference
was in how the sample-stream  passed to the tagging API was created: I was
using my own scala code to create the sample stream (which looked fine to
the naked eye). But, my code was reading the words in all CAPS. This
resulted in a slight but noticeable decline in performance: eg: 0.96 vs 0.95.
(More detailed output appended.)

Note that the sample stream for both the test and training data were in CAPS
- so maybe the model treats “Port” and “port” differently.


=== Command line case===
Sorting and merging events... done. Reduced 206678 events to 193001.
...
         Number of Event Tokens: 193001
             Number of Outcomes: 22
           Number of Predicates: 29155
...done.
Computing model parameters...
Performing 100 iterations.
   1:  .. loglikelihood=-638850.4721742678       0.13807468622688432
..
100:  .. loglikelihood=-13827.506953520902      0.9901537657612325
Accuracy: 0.9659110277825124

=== My code===
Sorting and merging events... done. Reduced 206678 events to 193059.
Done indexing.
Incorporating indexed data for training...
done.
         Number of Event Tokens: 193059
             Number of Outcomes: 16
           Number of Predicates: 27709
...done.
Computing model parameters...
Performing 100 iterations.
   1:  .. loglikelihood=-573033.0919349034        0.13807468622688432
..
100:  .. loglikelihood=-18019.22974368408        0.9831041523529355
Evaluating ... Accuracy: 0.9500596557013806

--
Cheers,
vishvAs

Re: An observation about the MAXENT tagger and CAPS

Reply via email to