[ 
https://issues.apache.org/jira/browse/OPENNLP-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16853463#comment-16853463
 ] 

Tim Allison commented on OPENNLP-1265:
--------------------------------------

How much are the normalizers slowing things down?  We need normalization, but 
let's see if one of them is slowing things down more than others.

Baseline is with the simple string based ngrams above.

Let's try turning of each default normalizer, one by one:

Turn off emoji:
5442 : por=50
5605 : por=50
5528 : por=50

Turn off url (alone, turn back on emoji):
4317 : por=50
4219 : por=50
4257 : por=50

Turn off twitter
5746 : por=50
5737 : por=50
5803 : por=50

Turn off number
6204 : por=50
6208 : por=50
5974 : por=50

Turn off shrink char
5371 : por=50
5619 : por=50
5352 : por=50

Now, for kicks, let's turn off all the normalizers:
2494 : por=50
2573 : por=50
2485 : por=50

The URL normalizer seems to be the one w the largest effect.

> Improve speed of lang detect
> ----------------------------
>
>                 Key: OPENNLP-1265
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1265
>             Project: OpenNLP
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> Over on TIKA-2790, we found that opennlp's language detector is far, far 
> slower than Optimaize and yalder.
> Let's use this ticket to see what we can do to improve lang detect's speed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to