Hi, Should entries in the abbreviation dictionary include '.' ?
The one included for unite test includes: http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/test/resources/opennlp/tools/sentdetect/abb.xml?view=co If we include the EOS character not all features are collected properly. The most important issue is here: if (inducedAbbreviations.contains(prefix)) { collectFeats.add("xabbrev"); } if we include the EOS in the dictionary entries this feature will never be collected. On the other hand we also have the following: if (inducedAbbreviations.contains(previous)) { collectFeats.add("vabbrev"); } This would fail if the previous token is an abbreviation and the abb dictionary does not include EOS characters. I would change the code to pass the EOS character as argument to the collectFeatures method. What do you think? After changing it locally the F1 of a cross validation evaluation increased from 98.3 to 98.4%. William
