Hi Jörn,

I did the changes a tried some configurations to get F1 scores:

Version from trunk:
4k corpus: 0.9566146041916966
4k corpus + abb: 0.9617340112389617

Version with the proposed changes:
4k corpus + abb: 0.9600320812725572

Version from trunk:
90k corpus: 0.9874615992715118
90k corpus + abb: 0.9875327690060235

Version with the proposed changes:
4k corpus + abb: 0.9869524324107983

I don't know if it is conclusive, but with the changes (case insensitive,
remove non word chars) the sentence detector performed worse at least for
my Portuguese corpus.

On Mon, Mar 19, 2012 at 3:52 PM, [email protected] <
[email protected]> wrote:

> I will try the following:
>
> 1) Create a new class AbbreviationDictionary that extends Dictionary and
>    - is always case insensitive
>    - override 'put' method to preprocess entries (only word characters,
> upper case etc)
> 2) Modify sentence detector model to use the new dictionary. Changes
> should be backward compatible.
> 3) Modify context generator
>
>
> What do you think ?
>
> William
>
> On Mon, Mar 19, 2012 at 1:10 PM, [email protected] <
> [email protected]> wrote:
>
>> Hi Jörn,
>>
>>
>> On Mon, Mar 19, 2012 at 5:42 AM, Jörn Kottmann <[email protected]>wrote:
>>
>>
>>> Abbreviations often can be written with dots or without. Maybe we should
>>> make a small utility method which removes all non-letters and use a
>>> case-insensitive
>>> dictionary to match the token. The same method could be run over the
>>> dictionary before
>>> it is used.
>>>
>>> What do you think?
>>>
>>
>> I think it is a good idea. I will try it.
>>
>>
>>> What happens if there is a comma?
>>>
>>
>> I don't know, do  you see an issue? Comma isn't an EOS character. Maybe
>> we would have problems in Tokenizer.
>>
>>
>>> Maybe we get better results when the dictionary feature is also combined
>>> with other features, e.g the next initial capital feature.
>>
>>
>> I will try it too. Thanks.
>>
>>
>
>

Reply via email to