[
https://issues.apache.org/jira/browse/OPENNLP-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16854997#comment-16854997
]
Tim Allison commented on OPENNLP-1265:
--------------------------------------
I ran some other experiments -- d05281a9 on
https://github.com/tballison/opennlp/tree/OPENNLP-1265
I changed the input string to 100x "estava em uma marcenaria na Rua Bruno
http://www.cnn.com [email protected]"
These are the elapsed times on 15,000 detections:
{noformat}
DefaultLanguageDetectorContextGenerator: 60700
NGramCharContextGenerator: 28229
SlightlyFasterNGramCharContextGenerator: 19339
LuceneDetectorContextGenerator: 13802
{noformat}
The first is the default, as is.
The second swaps out StringList for String.
The third is the same as the second, but uses a Map of <String,MutableInt>...I
wouldn't expect to see as much improvement on actual text...obv. this string is
highly redundant. :D
The 4th uses a Lucene's UAX29URLEmailEmailTokenizer and a customized
NGramFilter that appends and prepends a space to each token so that we get the
start token/end token space sentinels.
> Improve speed of lang detect
> ----------------------------
>
> Key: OPENNLP-1265
> URL: https://issues.apache.org/jira/browse/OPENNLP-1265
> Project: OpenNLP
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
>
> Over on TIKA-2790, we found that opennlp's language detector is far, far
> slower than Optimaize and yalder.
> Let's use this ticket to see what we can do to improve lang detect's speed.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)