Hi Jörn,

On Mon, Mar 19, 2012 at 5:42 AM, Jörn Kottmann <[email protected]> wrote:


> Abbreviations often can be written with dots or without. Maybe we should
> make a small utility method which removes all non-letters and use a
> case-insensitive
> dictionary to match the token. The same method could be run over the
> dictionary before
> it is used.
>
> What do you think?
>

I think it is a good idea. I will try it.


> What happens if there is a comma?
>

I don't know, do  you see an issue? Comma isn't an EOS character. Maybe we
would have problems in Tokenizer.


> Maybe we get better results when the dictionary feature is also combined
> with other features, e.g the next initial capital feature.


I will try it too. Thanks.

Reply via email to