[ 
https://issues.apache.org/jira/browse/TIKA-568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966974#action_12966974
 ] 

Ken Krugler commented on TIKA-568:
----------------------------------

OK, thanks for prodding me on this one - I made the same change locally (a 
getDistance() method). I also changed the algorithm to use the relative delta 
between the best hit and the second best hit to decide if it was close enough, 
versus using an absolute value, which improved things.

But as Jan mentioned, the real issue is that the algorithm currently used has 
some significant short-comings.


> Language Detection isReasonablyCertain() hides valuable information
> -------------------------------------------------------------------
>
>                 Key: TIKA-568
>                 URL: https://issues.apache.org/jira/browse/TIKA-568
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Priority: Minor
>         Attachments: TIKA-568.patch
>
>
> LanguageIdentifier.isReasonablyCertain() hardcodes a threshold for language 
> detection, which is fine, except applications should be allowed to decide 
> what threshold suits them.  For instance, how was 0.022 decided?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to