[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

Tim Allison (JIRA) Fri, 10 May 2019 13:33:29 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837587#comment-16837587
 ]


Tim Allison commented on TIKA-2790:
-----------------------------------

 [^langid_20190510.zip] these are the results with the current methodology.

There is a big table file that records every output, and then there is an 
aggregations file with various rollups of the data.  My basic takeaways don't 
change much since the initial run.  However, it is better to have multiple 
samples per length/noise/language tuple.

[~kkrugler], please let me know if I've mis-used yalder or if there are better 
ways to run it.  It is by far the fastest, but it isn't very robust against the 
noise I added...we may want to revisit that part of the methodology.

I understand that putting a huge block of text into OpenNLP is a bad idea 
(thank you [~joern]), but I can't for the life of me figure out why going from 
10000 to 100000 would drop accuracy from 0.99 to 0.57...unless there's a bug in 
my code?!  Note, too, that OpenNLP gets _better_ with more noise on 100k blocks 
of text.

This is all still tentative, but please do take a look and let me know what you 
find.

> Consider switching lang-detection in tika-eval to open-nlp
> ----------------------------------------------------------
>
>                 Key: TIKA-2790
>                 URL: https://issues.apache.org/jira/browse/TIKA-2790
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: langid_20190509.zip, langid_20190510.zip
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp

Reply via email to