[jira] [Commented] (OPENNLP-1265) Improve speed of lang detect

Tim Allison (JIRA) Fri, 07 Jun 2019 06:52:34 -0700


    [ 
https://issues.apache.org/jira/browse/OPENNLP-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858660#comment-16858660
 ]


Tim Allison commented on OPENNLP-1265:
--------------------------------------

I found that Yalder's amazing speed is because it only processes around the 
first 60 characters (on average).

Based on a review of Yalder, Optimaize and OpenNLP, I see the following areas 
for improvement in speed:
 # Limit the mail_regex in the UrlCharSequenceNormalizer (OPENNLP-1266)
 # Swap out StringList for String (this halves the processing time)...if 
OPENNLP devs are on board with this, I'll open a separate issue and PR.
 # Stop short – don't process the full String (OPENNLP-1267)
 ## Don't copy the full String
 ## Don't normalize the full String
 ## Don't compute character ngrams on the full String

> Improve speed of lang detect
> ----------------------------
>
>                 Key: OPENNLP-1265
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1265
>             Project: OpenNLP
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> Over on TIKA-2790, we found that opennlp's language detector is far, far 
> slower than Optimaize and yalder.
> Let's use this ticket to see what we can do to improve lang detect's speed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OPENNLP-1265) Improve speed of lang detect

Reply via email to