[jira] [Commented] (TIKA-369) Improve accuracy of language detection

Ken Krugler (JIRA) Tue, 19 Feb 2013 06:49:19 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13581315#comment-13581315
 ]


Ken Krugler commented on TIKA-369:
----------------------------------

Some questions then about integrating language-detection:

1. Do we care about thread safety?

If yes, then I think we'd either need our own version of the library, or get 
some fixes rolled into the upstream project.

2. How much control over settings?

E.g. specifying the set of supported languages, assigning a priori language 
probabilities, specifying max text length, etc?

If neither is an issue, then I could roll this in pretty quickly.

                
> Improve accuracy of language detection
> --------------------------------------
>
>                 Key: TIKA-369
>                 URL: https://issues.apache.org/jira/browse/TIKA-369
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.6
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>         Attachments: lingdet-mccs.pdf, Surprise and Coincidence.pdf, 
> textcat.pdf
>
>
> Currently the LanguageProfile code uses 3-grams to find the best language 
> profile using Pearson's chi-square test. This has three issues:
> 1. The results aren't very good for short runs of text. Ted Dunning's paper 
> (attached) indicates that a log-likelihood ratio (LLR) test works much 
> better, which would then make language detection faster due to less text 
> needing to be processed.
> 2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact 
> value as a threshold for certainty. This is very sensitive to the amount of 
> text being processed, and thus gives false negative results for short runs of 
> text.
> 3. Certainty should also be based on how much better the result is for 
> language X, compared to the next best language. If two languages both had 
> identical sum-of-squares values, and this value was below the threshold, then 
> the result is still not very certain.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-369) Improve accuracy of language detection

Reply via email to