Vladimir Yuryev wrote:

Hi, Andjej!

How you tested the Polish texts with what stemer?
Thanks,
Vladimir.


No reason to be too modest, Leo.. I tested your stemmer on English, Swedish and Polish texts (including F-measure vs. training set size plots), and it works exceptionally well indeed. Highly recommended!

Well, I have several corpora of Polish language, which together amount to roughly 90,000 words (nouns and verbs) having at least 4 inflected forms. This set is randomized (i.e. lines of words + forms are in random order). I've split this into two parts - one of a fixed size, as a test set, and one of variable size as a training set. Then I compile stemmer tables using variable number of training examples, and using differnt settings (trie, multi-trie, different optimizations, etc..). Then for each output table I test the precision/recall of correct base forms (lemmatization), and of ability to create unique stems (stemming). Finally, I select the "best" table, which gives reasonably good results vs. table size. To put it in plain terms, e.g. for tables roughly 300kB in size (created from training set of 3000 unique words + their forms) in best cases I get ~90% of correct stems, and ~70% of correct lemmas. Which is a _very_ good result!


--
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to