[ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886096#comment-16886096 ]
Tim Allison commented on TIKA-2790: ----------------------------------- I'm attaching a final "out of the box" comparison against the 103 langs in the standard distro of OpenNLP. The scoring only considers the languages if a given detector claims that it can identify it. Nevertheless, of course, if one detector has models for 200 languages, and another is targeted to the 103 langs specifically, there will be some...differences...ymmv. The custom, probing OpenNLP-based lang detector for tika-eval uses a model with 121 languages. This shows that the tika-eval lang id's performance does not degrade like OpenNLP's 1.9.1's lang detector does on long pieces of text, and it is far faster than OpenNLP 1.9.1. There are two areas for improvement in the custom tika-eval detector: short text and noisy text -- Optimaize is still much better on both -- although, to be fair, Optimaize has fewer language models. Yalder, of course, is still the fastest, by far. > Consider switching lang-detection in tika-eval to open-nlp > ---------------------------------------------------------- > > Key: TIKA-2790 > URL: https://issues.apache.org/jira/browse/TIKA-2790 > Project: Tika > Issue Type: Improvement > Reporter: Tim Allison > Priority: Major > Attachments: fra_mixed_100000_0.0_0.txt, hasEnough.png, > langid_20190509.zip, langid_20190510.zip, langid_20190514.zip, > langid_20190514_plus_minus_1.zip, rollups_20190716.zip, timeVsLength.png > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)