[
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16836719#comment-16836719
]
Tim Allison commented on TIKA-2790:
-----------------------------------
In addition to speed and error rate on different lengths of text, I'd also like
to see how sensitive the confidence scores are to noised text.
I took opennlp's subset of the leipzig corpus and randomly selected text of
char lengths: 50, 100, 200, 500, 1000, 10000 and 100000.
For each text, I also randomly added noise at a rate of 5%, 10%, 20%, 30%, 50%
and 90% -- single character random selection from codepoints 0-1,000,000.
I added a pseudo language "num" that is composed solely of (Arabic) numbers,
spaces and commas.
I slightly modified yalder to allow simpler loading of all models (core and
extras) -- see my {{load_all_langs}} branch.
As [~kkrugler] observed, yalder is much, much faster than opennlp:
||Detector||Length||Millis||Avg(ms)||Stdev||
|YalderDetector|50|1046|1.22|0.55|
|YalderDetector|100|1057|1.24|0.48|
|YalderDetector|200|1166|1.37|0.66|
|YalderDetector|500|1070|1.25|0.52|
|YalderDetector|1000|1123|1.31|0.53|
|YalderDetector|10000|1184|1.39|0.52|
|YalderDetector|100000|2495|2.92|3.12|
|OptimaizeLangDetector|50|1039|1.22|0.52|
|OptimaizeLangDetector|100|1054|1.23|0.5|
|OptimaizeLangDetector|200|1085|1.27|0.51|
|OptimaizeLangDetector|500|1142|1.34|0.54|
|OptimaizeLangDetector|1000|1202|1.41|0.57|
|OptimaizeLangDetector|10000|1983|2.32|0.82|
|OptimaizeLangDetector|100000|10465|12.25|9.09|
|OpenNLPLangDetector|50|1019|1.19|0.41|
|OpenNLPLangDetector|100|1193|1.4|0.51|
|OpenNLPLangDetector|200|1400|1.64|0.54|
|OpenNLPLangDetector|500|1968|2.3|0.63|
|OpenNLPLangDetector|1000|2992|3.5|1.14|
|OpenNLPLangDetector|10000|15450|18.09|12.47|
|OpenNLPLangDetector|100000|108240|126.74|52.4|
> Consider switching lang-detection in tika-eval to open-nlp
> ----------------------------------------------------------
>
> Key: TIKA-2790
> URL: https://issues.apache.org/jira/browse/TIKA-2790
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Minor
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)