[Nutch-dev] language identification training data

karl wettin Wed, 07 Mar 2007 23:12:29 -0800

Hello Nutch,

perhaps you are interested in using my language detection training  
data harvester to support more languages with your current  
implementation. It downloads the Wikipedia article of the home  
country for each language to be trained, in all languages that should  
be trained.


As I'm not certain how feature selection is made on the training data  
with the nutch language identitfier, this may or may not be a good  
training set. In my implementation it means that the text to be  
classified can speak about common foregin named entities without  
giving a false positive, especially in short texts.

LUCENE-826

Currently 35 languages, mainly western/northern proto indo europeean  
and fino-urgic.

-- 
karl

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] language identification training data

Reply via email to