[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

Tim Allison (JIRA) Tue, 16 Jul 2019 05:59:48 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886096#comment-16886096
 ]


Tim Allison commented on TIKA-2790:
-----------------------------------

I'm attaching a final "out of the box" comparison against the 103 langs in the 
standard distro of OpenNLP.  The scoring only considers the languages if a 
given detector claims that it can identify it.  Nevertheless, of course, if one 
detector has models for 200 languages, and another is targeted to the 103 langs 
specifically, there will be some...differences...ymmv.  

The custom, probing OpenNLP-based lang detector for tika-eval uses a model with 
121 languages.

This shows that the tika-eval lang id's performance does not degrade like 
OpenNLP's 1.9.1's lang detector does on long pieces of text, and it is far 
faster than OpenNLP 1.9.1.  There are two areas for improvement in the custom 
tika-eval detector: short text and noisy text -- Optimaize is still much better 
on both -- although, to be fair, Optimaize has fewer language models.  Yalder, 
of course, is still the fastest, by far.

> Consider switching lang-detection in tika-eval to open-nlp
> ----------------------------------------------------------
>
>                 Key: TIKA-2790
>                 URL: https://issues.apache.org/jira/browse/TIKA-2790
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: fra_mixed_100000_0.0_0.txt, hasEnough.png, 
> langid_20190509.zip, langid_20190510.zip, langid_20190514.zip, 
> langid_20190514_plus_minus_1.zip, rollups_20190716.zip, timeVsLength.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

Reply via email to