[ 
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872418#comment-16872418
 ] 

Tim Allison commented on TIKA-2790:
-----------------------------------

bq.  For an apples-to-apples comparison with OpenNLP, I guess you'd have to 
load the same 103 language models that they support (or some intersection of 
the same?)

Y.  Absolutely. The initial comparison was "out of the box"* ... apples to 
oranges. *With the one exception that I loaded all of Yalder's languages, 
including the extras.  I wanted to see, initially, what happens if we take the 
packages off the shelf.  I agree that it would be better to do a follow-on 
apples-apples. :)

bq. as yalder is slower than Optimaize & OpenNLP when early termination is 
disabled,
This has been puzzling me as well.  My _guess_ is that Yalder is updating the 
stats with every new known ngram, rather than batching counts.  But there may 
very well be something else going on, including the 2x number of languages that 
Yalder was handling!

bq.  and even slower on short text with early termination
I'd want to do quite a bit more benchmarking on short texts to confirm this 
generally.  I worry about micro-benchmarking pitfalls.  I am more comfortable 
with the results on longer chunks of text.



> Consider switching lang-detection in tika-eval to open-nlp
> ----------------------------------------------------------
>
>                 Key: TIKA-2790
>                 URL: https://issues.apache.org/jira/browse/TIKA-2790
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: fra_mixed_100000_0.0_0.txt, hasEnough.png, 
> langid_20190509.zip, langid_20190510.zip, langid_20190514.zip, 
> langid_20190514_plus_minus_1.zip, timeVsLength.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to