Hi Jörn, On Mon, Mar 19, 2012 at 5:42 AM, Jörn Kottmann <[email protected]> wrote:
> Abbreviations often can be written with dots or without. Maybe we should > make a small utility method which removes all non-letters and use a > case-insensitive > dictionary to match the token. The same method could be run over the > dictionary before > it is used. > > What do you think? > I think it is a good idea. I will try it. > What happens if there is a comma? > I don't know, do you see an issue? Comma isn't an EOS character. Maybe we would have problems in Tokenizer. > Maybe we get better results when the dictionary feature is also combined > with other features, e.g the next initial capital feature. I will try it too. Thanks.
