Re: Should entries in the abbreviation dictionary include '.' ?

Jörn Kottmann Mon, 19 Mar 2012 01:43:17 -0700

On 03/16/2012 09:47 PM, [email protected] wrote:

Hi,


Should entries in the abbreviation dictionary include '.' ?

The one included for unite test includes:
http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/test/resources/opennlp/tools/sentdetect/abb.xml?view=co

If we include the EOS character not all features are collected properly.

The most important issue is here:

       if (inducedAbbreviations.contains(prefix)) {
         collectFeats.add("xabbrev");
       }

if we include the EOS in the dictionary entries this feature will
never be collected.

On the other hand we also have the following:

       if (inducedAbbreviations.contains(previous)) {
         collectFeats.add("vabbrev");
       }

This would fail if the previous token is an abbreviation and the abb
dictionary does not include EOS characters.

I would change the code to pass the EOS character as argument to the
collectFeatures method. What do you think?


Abbreviations often can be written with dots or without. Maybe we should

make a small utility method which removes all non-letters and use acase-insensitivedictionary to match the token. The same method could be run over thedictionary before

it is used.

What do you think?
What happens if there is a comma?

Maybe we get better results when the dictionary feature is also combined
with other features, e.g the next initial capital feature.

Jörn

Re: Should entries in the abbreviation dictionary include '.' ?

Reply via email to