On 03/16/2012 09:47 PM, [email protected] wrote:
Hi,
Should entries in the abbreviation dictionary include '.' ?
The one included for unite test includes:
http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/test/resources/opennlp/tools/sentdetect/abb.xml?view=co
If we include the EOS character not all features are collected properly.
The most important issue is here:
if (inducedAbbreviations.contains(prefix)) {
collectFeats.add("xabbrev");
}
if we include the EOS in the dictionary entries this feature will
never be collected.
On the other hand we also have the following:
if (inducedAbbreviations.contains(previous)) {
collectFeats.add("vabbrev");
}
This would fail if the previous token is an abbreviation and the abb
dictionary does not include EOS characters.
I would change the code to pass the EOS character as argument to the
collectFeatures method. What do you think?
Abbreviations often can be written with dots or without. Maybe we should
make a small utility method which removes all non-letters and use a
case-insensitive
dictionary to match the token. The same method could be run over the
dictionary before
it is used.
What do you think?
What happens if there is a comma?
Maybe we get better results when the dictionary feature is also combined
with other features, e.g the next initial capital feature.
Jörn