Re: [jira] [Created] (TIKA-1405) German content detected as French

Oleg Tikhonov Sat, 30 Aug 2014 05:09:01 -0700

Hi,
does context contain only one language or it's mixed.
if the text contains a "single" language then it seems something strange in
our language profiles. If it mixed - then it kindda ok. The first detected
will be an answer.


What is a size of context? one word or "bunch" of text? Basically to detect
language on big text is more precise then on small.

Best regards,
Oleg


On Sat, Aug 30, 2014 at 1:13 PM, Zaheer Beig (JIRA) <[email protected]> wrote:

> Zaheer Beig created TIKA-1405:
> ---------------------------------
>
>              Summary: German content detected as French
>                  Key: TIKA-1405
>                  URL: https://issues.apache.org/jira/browse/TIKA-1405
>              Project: Tika
>           Issue Type: Bug
>           Components: languageidentifier
>     Affects Versions: 1.4
>          Environment: Linux
>             Reporter: Zaheer Beig
>
>
> Hi,
> We are using Apache Tika 1.4  for document conversion to text and language
> detection in one of our project. We are facing below issues with language
> detection:
>
> 1. When the text is in all UPPER CASE, even though the language is
> English, it gets detected as Estonian.
> 2. For many of our German content , language gets detected as French
> [Though this is not the case for all German content]
>
> Any update on this will be very helpful.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>

Re: [jira] [Created] (TIKA-1405) German content detected as French

Reply via email to