Hi Thilo,

2010/11/29 Thilo Götz <[email protected]>
>
>
> Hi Tommaso,
>
> do you know what algorithm Tika uses for language identification?
>

Tika uses a collection of existing language profiles, then a language
profile is created from the text to analyze; after that the language profile
which has the lowest distance from the content generated profile represents
the actual language of the analyzed text. You can see [1].


> I'm wondering how well it does.  I'm very much in favor of having
> an out-of-the-box language ID annotator for UIMA.
>
>
:-)
That was also my idea when I proposed that, since many different algorithms
exist for language identification, maybe such a component's role would be to
aggregate different algorithms that do the same thing in different ways
(also providing an extension point).
Jörn pointed out we can put lang id algorithms in projects we already have,
that is also a reasonable approach.
So the 3 algorithms in place can go inside a single component (I called it
SimpleLanguageAnnotator) or alternatively to different projects
(TikaAnnotator, AlchemyAPIAnnotator and DictionaryAnnotator plus Jörn said
he can make it inside OpenNLP).

Whichever we choose in my opinion it'd be a good idea to put some notes
about language identification with UIMA on the website.
Regards,
Tommaso

[1] :
http://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/java/org/apache/tika/language/

Reply via email to