[jira] [Commented] (OPENNLP-1265) Improve speed of lang detect

Tim Allison (JIRA) Mon, 03 Jun 2019 12:56:30 -0700


    [ 
https://issues.apache.org/jira/browse/OPENNLP-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16854997#comment-16854997
 ]


Tim Allison commented on OPENNLP-1265:
--------------------------------------

I ran some other experiments -- d05281a9 on 
https://github.com/tballison/opennlp/tree/OPENNLP-1265

I changed the input string to 100x "estava em uma marcenaria na Rua Bruno 
http://www.cnn.com [email protected]"

These are the elapsed times on 15,000 detections: 

{noformat}
DefaultLanguageDetectorContextGenerator: 60700
NGramCharContextGenerator: 28229
SlightlyFasterNGramCharContextGenerator: 19339
LuceneDetectorContextGenerator: 13802
{noformat}

The first is the default, as is.
The second swaps out StringList for String.
The third is the same as the second, but uses a Map of <String,MutableInt>...I 
wouldn't expect to see as much improvement on actual text...obv. this string is 
highly redundant. :D
The 4th uses a Lucene's UAX29URLEmailEmailTokenizer and a customized 
NGramFilter that appends and prepends a space to each token so that we get the 
start token/end token space sentinels.

> Improve speed of lang detect
> ----------------------------
>
>                 Key: OPENNLP-1265
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1265
>             Project: OpenNLP
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> Over on TIKA-2790, we found that opennlp's language detector is far, far 
> slower than Optimaize and yalder.
> Let's use this ticket to see what we can do to improve lang detect's speed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OPENNLP-1265) Improve speed of lang detect

Reply via email to