[
https://issues.apache.org/jira/browse/OPENNLP-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858660#comment-16858660
]
Tim Allison commented on OPENNLP-1265:
--------------------------------------
I found that Yalder's amazing speed is because it only processes around the
first 60 characters (on average).
Based on a review of Yalder, Optimaize and OpenNLP, I see the following areas
for improvement in speed:
# Limit the mail_regex in the UrlCharSequenceNormalizer (OPENNLP-1266)
# Swap out StringList for String (this halves the processing time)...if
OPENNLP devs are on board with this, I'll open a separate issue and PR.
# Stop short – don't process the full String (OPENNLP-1267)
## Don't copy the full String
## Don't normalize the full String
## Don't compute character ngrams on the full String
> Improve speed of lang detect
> ----------------------------
>
> Key: OPENNLP-1265
> URL: https://issues.apache.org/jira/browse/OPENNLP-1265
> Project: OpenNLP
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
>
> Over on TIKA-2790, we found that opennlp's language detector is far, far
> slower than Optimaize and yalder.
> Let's use this ticket to see what we can do to improve lang detect's speed.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)