I will try the following:

1) Create a new class AbbreviationDictionary that extends Dictionary and
   - is always case insensitive
   - override 'put' method to preprocess entries (only word characters,
upper case etc)
2) Modify sentence detector model to use the new dictionary. Changes should
be backward compatible.
3) Modify context generator


What do you think ?

William

On Mon, Mar 19, 2012 at 1:10 PM, [email protected] <
[email protected]> wrote:

> Hi Jörn,
>
>
> On Mon, Mar 19, 2012 at 5:42 AM, Jörn Kottmann <[email protected]> wrote:
>
>
>> Abbreviations often can be written with dots or without. Maybe we should
>> make a small utility method which removes all non-letters and use a
>> case-insensitive
>> dictionary to match the token. The same method could be run over the
>> dictionary before
>> it is used.
>>
>> What do you think?
>>
>
> I think it is a good idea. I will try it.
>
>
>> What happens if there is a comma?
>>
>
> I don't know, do  you see an issue? Comma isn't an EOS character. Maybe we
> would have problems in Tokenizer.
>
>
>> Maybe we get better results when the dictionary feature is also combined
>> with other features, e.g the next initial capital feature.
>
>
> I will try it too. Thanks.
>
>

Reply via email to