I performed a few experiments with two Portuguese corpus. All tests was
with MAXENT, iterations 100 and cutoff 5.

F1 results for a 96k sentences corpus:
Default CG: 0.9853360692658026
Default CG + Abb: 0.9854463195403679 (+0.0001)

Custom CG: 0.9911605417797043
Custom CG + Abb: 0.9911809163438341 (+0.00002)

To create the custom context generator I added some features that I took
from Tokenizer.

The number indicates that the abbreviation dictionary barely increased F1.
But trying the model I notice that in fact it performs better while
handling abbreviations. I notice the same by running the cross validator
with the option "-misclassified true"

The feeling I have about it is that there are far more trivial cases, and
the special cases that are affected by the abbreviation dictionary are so
low that it doesn't affect the F1.

I also tried with a 4k sentences corpus. F1 values:

Custom CG: 0.9566960705693666
Custom CG + Abb: 0.958779443254818 (+0.002)

William

On Wed, Feb 15, 2012 at 1:37 PM, Katrin Tomanek
<katrin.toma...@averbis.com>wrote:

> Hi,
>
> I am trying to optimize my sentence detector model by adding an
> abbreviation dictionary.
>
> Can anybody give some hints on best practices which abbreviations to add
> here? E.g., only very frequent ones? Problematic ones? Any?
>
> I just experimented with a very big abbreviation dictionary and found
> that, in german medical patient records, this rather decreases performance.
>
> Any experiences were abbreviation dictionaries improved performance ?
>
>
> Best
> Katrin
>

Reply via email to