Hi William,

thanks for sharing your experiences.

I did another test:
* Default Context Generator
* Corpus: Genia
* Variant 1: no abbreviation dictionary
* Variant 2: big abbreviation dictionary of ~1000 entries
* Variant 3: small abbreviations dictionary of only common and well known abbreviations (15 entries)

here's what I get
* Variant 1 (F: 0.9910290237467019)
* Variant 2 (F: 0.9907676074914271)
* Variant 3 (F: 0.9910290237467019)

--> so for me, using abbreviation dictionary does not help (at least not in evaluation).

However, when my users start finding common problems on abbreviations I might start feeding an abbreviation dictionary which could handle those maybe rare, but annoying problems...


Cheers
Katrin

On 02/15/2012 05:46 PM, william.co...@gmail.com wrote:
I performed a few experiments with two Portuguese corpus. All tests was
with MAXENT, iterations 100 and cutoff 5.

F1 results for a 96k sentences corpus:
Default CG: 0.9853360692658026
Default CG + Abb: 0.9854463195403679 (+0.0001)

Custom CG: 0.9911605417797043
Custom CG + Abb: 0.9911809163438341 (+0.00002)

To create the custom context generator I added some features that I took
from Tokenizer.

The number indicates that the abbreviation dictionary barely increased F1.
But trying the model I notice that in fact it performs better while
handling abbreviations. I notice the same by running the cross validator
with the option "-misclassified true"

The feeling I have about it is that there are far more trivial cases, and
the special cases that are affected by the abbreviation dictionary are so
low that it doesn't affect the F1.

I also tried with a 4k sentences corpus. F1 values:

Custom CG: 0.9566960705693666
Custom CG + Abb: 0.958779443254818 (+0.002)

William

On Wed, Feb 15, 2012 at 1:37 PM, Katrin Tomanek
<katrin.toma...@averbis.com>wrote:

Hi,

I am trying to optimize my sentence detector model by adding an
abbreviation dictionary.

Can anybody give some hints on best practices which abbreviations to add
here? E.g., only very frequent ones? Problematic ones? Any?

I just experimented with a very big abbreviation dictionary and found
that, in german medical patient records, this rather decreases performance.

Any experiences were abbreviation dictionaries improved performance ?


Best
Katrin




--
Dr. Katrin Tomanek
Averbis GmbH
Tennenbacher Strasse 11
D-79106 Freiburg

Fon: +49 (0) 761 - 203 97696
Fax: +49 (0) 761 - 203 97694
E-Mail: katrin.toma...@averbis.com

Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó
Sitz der Gesellschaft: Freiburg i. Br.
AG Freiburg i. Br., HRB 701080

Reply via email to