I performed a few experiments with two Portuguese corpus. All tests was with MAXENT, iterations 100 and cutoff 5.
F1 results for a 96k sentences corpus: Default CG: 0.9853360692658026 Default CG + Abb: 0.9854463195403679 (+0.0001) Custom CG: 0.9911605417797043 Custom CG + Abb: 0.9911809163438341 (+0.00002) To create the custom context generator I added some features that I took from Tokenizer. The number indicates that the abbreviation dictionary barely increased F1. But trying the model I notice that in fact it performs better while handling abbreviations. I notice the same by running the cross validator with the option "-misclassified true" The feeling I have about it is that there are far more trivial cases, and the special cases that are affected by the abbreviation dictionary are so low that it doesn't affect the F1. I also tried with a 4k sentences corpus. F1 values: Custom CG: 0.9566960705693666 Custom CG + Abb: 0.958779443254818 (+0.002) William On Wed, Feb 15, 2012 at 1:37 PM, Katrin Tomanek <katrin.toma...@averbis.com>wrote: > Hi, > > I am trying to optimize my sentence detector model by adding an > abbreviation dictionary. > > Can anybody give some hints on best practices which abbreviations to add > here? E.g., only very frequent ones? Problematic ones? Any? > > I just experimented with a very big abbreviation dictionary and found > that, in german medical patient records, this rather decreases performance. > > Any experiences were abbreviation dictionaries improved performance ? > > > Best > Katrin >