Hi Thilo, 2010/11/29 Thilo Götz <[email protected]> > > > Hi Tommaso, > > do you know what algorithm Tika uses for language identification? >
Tika uses a collection of existing language profiles, then a language profile is created from the text to analyze; after that the language profile which has the lowest distance from the content generated profile represents the actual language of the analyzed text. You can see [1]. > I'm wondering how well it does. I'm very much in favor of having > an out-of-the-box language ID annotator for UIMA. > > :-) That was also my idea when I proposed that, since many different algorithms exist for language identification, maybe such a component's role would be to aggregate different algorithms that do the same thing in different ways (also providing an extension point). Jörn pointed out we can put lang id algorithms in projects we already have, that is also a reasonable approach. So the 3 algorithms in place can go inside a single component (I called it SimpleLanguageAnnotator) or alternatively to different projects (TikaAnnotator, AlchemyAPIAnnotator and DictionaryAnnotator plus Jörn said he can make it inside OpenNLP). Whichever we choose in my opinion it'd be a good idea to put some notes about language identification with UIMA on the website. Regards, Tommaso [1] : http://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/java/org/apache/tika/language/
