language identification training data

karl wettin Wed, 07 Mar 2007 23:12:22 -0800

Hello Nutch,

perhaps you are interested in using my language detection trainingdata harvester to support more languages with your currentimplementation. It downloads the Wikipedia article of the homecountry for each language to be trained, in all languages that shouldbe trained.

As I'm not certain how feature selection is made on the training datawith the nutch language identitfier, this may or may not be a goodtraining set. In my implementation it means that the text to beclassified can speak about common foregin named entities withoutgiving a false positive, especially in short texts.


LUCENE-826

Currently 35 languages, mainly western/northern proto indo europeeanand fino-urgic.


--
karl

language identification training data

Reply via email to