[
https://issues.apache.org/jira/browse/OPENNLP-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16853462#comment-16853462
]
Tim Allison edited comment on OPENNLP-1265 at 5/31/19 11:04 PM:
----------------------------------------------------------------
Baseline:
Input string: 10000x "estava em uma marcenaria na Rua Bruno "
model: langdetect-183.bin
runs: 4 (don't show results for first warmup run)
Results (millis, lang)
13366 : por=50
13608 : por=50
14035 : por=50
If we switch to working with string based ngrams instead of StringList, there's
a 2x improvement:
6087 : por=50
6202 : por=50
6146 : por=50
see:
https://github.com/tballison/opennlp/blob/OPENNLP-1265/opennlp-tools/src/main/java/opennlp/tools/ngram/NGramModelSimplified.java
was (Author: [email protected]):
Baseline:
Input string: 10000x "estava em uma marcenaria na Rua Bruno "
model: langdetect-183.bin
runs: 4 (don't show results for first warmup run)
Results (millis)
13366 : {por=50}
13608 : {por=50}
14035 : {por=50}
If we switch to working with string based ngrams instead of StringList, there's
a 2x improvement:
6087 : {por=50}
6202 : {por=50}
6146 : {por=50}
see:
https://github.com/tballison/opennlp/blob/OPENNLP-1265/opennlp-tools/src/main/java/opennlp/tools/ngram/NGramModelSimplified.java
> Improve speed of lang detect
> ----------------------------
>
> Key: OPENNLP-1265
> URL: https://issues.apache.org/jira/browse/OPENNLP-1265
> Project: OpenNLP
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
>
> Over on TIKA-2790, we found that opennlp's language detector is far, far
> slower than Optimaize and yalder.
> Let's use this ticket to see what we can do to improve lang detect's speed.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)