Anyone knows of a good language detection library that can detect what language a document (text) is ?
Language detection is easy. It's just a simple text classification problem. One way you can do this is using Lucene itself. Create a so-called pseudo-document for each language consisting of lots of text (1 MB or more, ideally). Then build a Lucene index using a character n-gram tokenizer. Eg. "John Smith" tokenizes to "Jo", "oh", "hn", "n ", " S", "Sm", "mi", "it", "th" with 2-grams. You'll have to make sure to index beyond the first 1000 tokens or whatever Lucene is set to by default. To do language ID, just treat the language to be identified as the basis of a query. Parse it using the same character n-gram tokenizer. The highest-scoring result is the answer and if two score high, you know there may be some ambiguity. You can't trust Lucene's normalized scoring for rejection, though. Make sure the tokenizer includes spaces as well as non-space characters (though all spaces may be normalized to a single whitespace). Using more orders (1-grams, 2-grams, 3-grams, etc.) gives more accuracy; the IDF weighting is quite sensible here and will work out the details for the counts for you. For a more sophisticated approach, check out LingPipe's language ID tutorial, which is based on probabilistic character language models. Think of it as similar to the Lucene model but with different term weighting. http://www.alias-i.com/lingpipe/demos/tutorial/langid/read-me.html Here's accuracy vs. input length on a set of 15 languages from the Leipzig Corpus collection (just one of the many evals in the tutorial): #chars accuracy 1 22.59% 2 34.82% 4 58.55% 8 81.17% 16 92.45% 32 97.33% 64 98.99% 128 99.67% The end of the tutorial has references to other popular language ID packages online (e.g. TextCat, which is Gertjan van Noord's Perl package). And it also has references to the technical background on TF/IDF classification with n-grams and character language models. - Bob Carpenter Alias-i --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]