A 2014-10-26 22:28, Mikel Forcada escrigué:
> Fran,
> 
> [snip]
> 
>> Hmm, I'm not sure if this is the case... e.g. what would happen if
>> you have, e.g.
>> 
> "^wound/wound<n><sg>/wind<vblex><past>/wound<vblex><pres>/wound<vblex><inf>$
>> From your corpus (or fractional counts or something) wound
>> wound<n><sg> 100 wound wind<vblex><past> 20 wound wound<vblex><pres>
>> 50 wound wound<vblex><inf> 3 And your most frequent analysis is
>> wound<n><sg>, but your CG has removed it, and left
>> "^wound/wind<vblex><past>/wound<vblex><pres>/wound<vblex><inf>$
>> Would it be good to know that the next most frequent analysis is
>> wound<vblex><pres> ?
> 
> Very good point there! OK, this means you would have a very looooooong
> list with all surface forms. I would only keep the most frequent
> surface forms (perhaps a couple of thousands would do nicely) and for
> less frequent forms, use the "generalized" forms.

Yeah... I was thinking that you would keep the surface forms just for 
the
vocabulary of your tagged corpus, and then use the generalised ones for
anything not in the tagged corpus.

In our ~30k word English tagged corpus that would mean storing it
for 6644 surface forms then "backing off" for anything else.

Fran

------------------------------------------------------------------------------
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to