[
https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13580250#comment-13580250
]
Pander Musubi commented on TIKA-369:
------------------------------------
I know someone from another community who has created a Java Servlet around
https://code.google.com/p/language-detection and it will be submitted back to
that project. At the moment he is making some improvements to already
functioning version but he could use some extra hands. If anybody is interested
in his current version in a Git repo please contact me and I will introduce the
both of you.
> Improve accuracy of language detection
> --------------------------------------
>
> Key: TIKA-369
> URL: https://issues.apache.org/jira/browse/TIKA-369
> Project: Tika
> Issue Type: Improvement
> Components: languageidentifier
> Affects Versions: 0.6
> Reporter: Ken Krugler
> Assignee: Ken Krugler
> Attachments: lingdet-mccs.pdf, Surprise and Coincidence.pdf,
> textcat.pdf
>
>
> Currently the LanguageProfile code uses 3-grams to find the best language
> profile using Pearson's chi-square test. This has three issues:
> 1. The results aren't very good for short runs of text. Ted Dunning's paper
> (attached) indicates that a log-likelihood ratio (LLR) test works much
> better, which would then make language detection faster due to less text
> needing to be processed.
> 2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact
> value as a threshold for certainty. This is very sensitive to the amount of
> text being processed, and thus gives false negative results for short runs of
> text.
> 3. Certainty should also be based on how much better the result is for
> language X, compared to the next best language. If two languages both had
> identical sum-of-squares values, and this value was below the threshold, then
> the result is still not very certain.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira