Re: Using Nutch LanguageIdentifierPlugin in Apache UIMA

Andrzej Bialecki Thu, 16 Aug 2007 10:43:55 -0700

Michael Baessler wrote:

Hi,
I'm one of the Apache UIMA committers and while searching for an opensource language detection technology I found the
Nutch LanguageIdentifierPlugin.



Hello Michael,

Now my questions:
Is there a place where I can find some more details about how yourlanguage identification works?

It uses character n-gram models of different languages, i.e. histogramsof relative frequencies of character groups. It builds a similar modelfor the text under examination, and then compares its model to otherpre-defined models. The best match wins. This method is described in apaper by Cavnar and Trenkle (http://citeseer.ist.psu.edu/68861.html).

This works very well even for short texts, and doesn't require anylinguistic knowledge. However, it works poorly for texts that containsections in different languages, or texts in an unknown language, orextremely short texts.

Will it be possible to share the language identification technology sothat we can wrap it as UIMA analysis engine? My current understandingis, that it is only available within Nutch but not separately.

There is a grass-roots effort underway to extract portions of Nutchrelated to content parsing into a separate framework, called Tika. JukkaZitting and Chris Mattmann would be the right people to talk to.

Since both projects are hosted on Apache, I don't see any license issueswhen using your technology. :-)

Neither do I. AFAIK, ASF encourages maximum re-use of Apache componentsover external ones.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Using Nutch LanguageIdentifierPlugin in Apache UIMA

Reply via email to