[
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837587#comment-16837587
]
Tim Allison commented on TIKA-2790:
-----------------------------------
[^langid_20190510.zip] these are the results with the current methodology.
There is a big table file that records every output, and then there is an
aggregations file with various rollups of the data. My basic takeaways don't
change much since the initial run. However, it is better to have multiple
samples per length/noise/language tuple.
[~kkrugler], please let me know if I've mis-used yalder or if there are better
ways to run it. It is by far the fastest, but it isn't very robust against the
noise I added...we may want to revisit that part of the methodology.
I understand that putting a huge block of text into OpenNLP is a bad idea
(thank you [~joern]), but I can't for the life of me figure out why going from
10000 to 100000 would drop accuracy from 0.99 to 0.57...unless there's a bug in
my code?! Note, too, that OpenNLP gets _better_ with more noise on 100k blocks
of text.
This is all still tentative, but please do take a look and let me know what you
find.
> Consider switching lang-detection in tika-eval to open-nlp
> ----------------------------------------------------------
>
> Key: TIKA-2790
> URL: https://issues.apache.org/jira/browse/TIKA-2790
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Minor
> Attachments: langid_20190509.zip, langid_20190510.zip
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)