Hello Nutch,

perhaps you are interested in using my language detection training data harvester to support more languages with your current implementation. It downloads the Wikipedia article of the home country for each language to be trained, in all languages that should be trained.

As I'm not certain how feature selection is made on the training data with the nutch language identitfier, this may or may not be a good training set. In my implementation it means that the text to be classified can speak about common foregin named entities without giving a false positive, especially in short texts.

LUCENE-826

Currently 35 languages, mainly western/northern proto indo europeean and fino-urgic.

--
karl

Reply via email to