Hello Nutch,
perhaps you are interested in using my language detection training
data harvester to support more languages with your current
implementation. It downloads the Wikipedia article of the home
country for each language to be trained, in all languages that should
be trained.
As I'm not certain how feature selection is made on the training data
with the nutch language identitfier, this may or may not be a good
training set. In my implementation it means that the text to be
classified can speak about common foregin named entities without
giving a false positive, especially in short texts.
LUCENE-826
Currently 35 languages, mainly western/northern proto indo europeean
and fino-urgic.
--
karl