Hi, does context contain only one language or it's mixed. if the text contains a "single" language then it seems something strange in our language profiles. If it mixed - then it kindda ok. The first detected will be an answer.
What is a size of context? one word or "bunch" of text? Basically to detect language on big text is more precise then on small. Best regards, Oleg On Sat, Aug 30, 2014 at 1:13 PM, Zaheer Beig (JIRA) <[email protected]> wrote: > Zaheer Beig created TIKA-1405: > --------------------------------- > > Summary: German content detected as French > Key: TIKA-1405 > URL: https://issues.apache.org/jira/browse/TIKA-1405 > Project: Tika > Issue Type: Bug > Components: languageidentifier > Affects Versions: 1.4 > Environment: Linux > Reporter: Zaheer Beig > > > Hi, > We are using Apache Tika 1.4 for document conversion to text and language > detection in one of our project. We are facing below issues with language > detection: > > 1. When the text is in all UPPER CASE, even though the language is > English, it gets detected as Estonian. > 2. For many of our German content , language gets detected as French > [Though this is not the case for all German content] > > Any update on this will be very helpful. > > > > -- > This message was sent by Atlassian JIRA > (v6.2#6252) >
