Hello Nutch, perhaps you are interested in using my language detection training data harvester to support more languages with your current implementation. It downloads the Wikipedia article of the home country for each language to be trained, in all languages that should be trained.
As I'm not certain how feature selection is made on the training data with the nutch language identitfier, this may or may not be a good training set. In my implementation it means that the text to be classified can speak about common foregin named entities without giving a false positive, especially in short texts. LUCENE-826 Currently 35 languages, mainly western/northern proto indo europeean and fino-urgic. -- karl ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers