Hi,

Should entries in the abbreviation dictionary include '.' ?

The one included for unite test includes:
http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/test/resources/opennlp/tools/sentdetect/abb.xml?view=co

If we include the EOS character not all features are collected properly.

The most important issue is here:

      if (inducedAbbreviations.contains(prefix)) {
        collectFeats.add("xabbrev");
      }

if we include the EOS in the dictionary entries this feature will
never be collected.

On the other hand we also have the following:

      if (inducedAbbreviations.contains(previous)) {
        collectFeats.add("vabbrev");
      }

This would fail if the previous token is an abbreviation and the abb
dictionary does not include EOS characters.

I would change the code to pass the EOS character as argument to the
collectFeatures method. What do you think?

After changing it locally the F1 of a cross validation evaluation
increased from 98.3 to 98.4%.

William

Reply via email to