[ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872418#comment-16872418 ]
Tim Allison commented on TIKA-2790: ----------------------------------- bq. For an apples-to-apples comparison with OpenNLP, I guess you'd have to load the same 103 language models that they support (or some intersection of the same?) Y. Absolutely. The initial comparison was "out of the box"* ... apples to oranges. *With the one exception that I loaded all of Yalder's languages, including the extras. I wanted to see, initially, what happens if we take the packages off the shelf. I agree that it would be better to do a follow-on apples-apples. :) bq. as yalder is slower than Optimaize & OpenNLP when early termination is disabled, This has been puzzling me as well. My _guess_ is that Yalder is updating the stats with every new known ngram, rather than batching counts. But there may very well be something else going on, including the 2x number of languages that Yalder was handling! bq. and even slower on short text with early termination I'd want to do quite a bit more benchmarking on short texts to confirm this generally. I worry about micro-benchmarking pitfalls. I am more comfortable with the results on longer chunks of text. > Consider switching lang-detection in tika-eval to open-nlp > ---------------------------------------------------------- > > Key: TIKA-2790 > URL: https://issues.apache.org/jira/browse/TIKA-2790 > Project: Tika > Issue Type: Improvement > Reporter: Tim Allison > Priority: Major > Attachments: fra_mixed_100000_0.0_0.txt, hasEnough.png, > langid_20190509.zip, langid_20190510.zip, langid_20190514.zip, > langid_20190514_plus_minus_1.zip, timeVsLength.png > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)