[
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856101#comment-16856101
]
Tim Allison edited comment on TIKA-2790 at 6/4/19 8:38 PM:
-----------------------------------------------------------
In going down the path of sampling, or stopping short...I wanted to see how
much text would be necessary for OpenNLP. So, to answer the question of
"what's the minimum length/minimum confidence after which the detector is
always correct." To answer that, I measured the inverse, what is the maximum
confidence and at what length when the detector incorrectly ids a language.
In the following table, I show the maximum wrong confidence for a given
language, the incorrectly detected language, and the text length at which that
was incorrectly detected. For example, at text length of 230 characters,
OpenNLP had a confidence of 0.43 that the text was 'hrv', but it was really
'bos'.
As the original confusion matrix shows, some lang pairs are much harder and
require more evidence, e.g. {{ekk}} and {{est}}, {{fas}} and {{pes}}, {{hrv}}
and {{bos}}, {{ind}} and {{sun}}, {{pus}} and {{por}}, but many languages
require a very small amount of text...
||Lang||WrongId||MaxWrongConf||MaxWrongLength||
|ast|nno|0.03|90|
|bak|tat|0.08|70|
|bos|hrv|0.43|230|
|cat|vol|0.01|10|
|ces|slk|0.01|10|
|cym|min|0.01|10|
|dan|war|0.02|30|
|deu|war|0.01|10|
|ekk|est|0.54|310|
|eng|nan|0.01|10|
|est|ekk|0.56|250|
|fas|pes|0.52|550|
|fin|min|0.01|10|
|fra|fin|0.01|10|
|gsw|lat|0.01|10|
|hrv|bos|0.64|1010|
|hun|nob|0.01|10|
|ind|sun|0.73|810|
|isl|fao|0.02|30|
|ita|fra|0.04|110|
|jav|afr|0.02|50|
|lav|lvs|0.35|170|
|lim|epo|0.02|30|
|ltz|vol|0.01|10|
|lvs|lav|0.03|30|
|mlt|eng|0.02|50|
|msa|ind|0.45|490|
|nan|tur|0.01|10|
|nds|plt|0.01|10|
|nep|san|0.01|10|
|nld|plt|0.01|10|
|nno|nob|0.12|130|
|nob|nno|0.62|290|
|oci|ita|0.01|10|
|pes|fas|0.57|730|
|pus|por|0.27|130|
|ron|lat|0.04|70|
|rus|mkd|0.02|10|
|slk|epo|0.01|10|
|slv|min|0.01|10|
|spa|vol|0.01|10|
|sqi|zul|0.01|30|
|sun|ind|0.60|790|
|swe|dan|0.02|30|
|tat|bak|0.03|30|
|tgl|ceb|0.01|10|
|tur|min|0.01|10|
|ukr|che|0.02|10|
|uzb|kir|0.02|10|
|vie|war|0.02|30|
|zul|swa|0.02|10|
was (Author: [email protected]):
In going down the path of sampling, or stopping short...I wanted to see how
much text would be necessary for OpenNLP. In the following table, I show the
maximum wrong confidence for a given language, the incorrectly detected
language, and the text length at which that was incorrectly detected. For
example, at text length of 230 characters, OpenNLP had a confidence of 0.43
that the text was 'hrv', but it was really 'bos'.
As the original confusion matrix shows, some lang pairs are much harder and
require more evidence, e.g. {{ekk}} and {{est}}, {{fas}} and {{pes}}, {{hrv}}
and {{bos}}, {{ind}} and {{sun}}, {{pus}} and {{por}}, but many languages
require a very small amount of text...
||Lang||WrongId||MaxWrongConf||MaxWrongLength||
|ast|nno|0.03|90|
|bak|tat|0.08|70|
|bos|hrv|0.43|230|
|cat|vol|0.01|10|
|ces|slk|0.01|10|
|cym|min|0.01|10|
|dan|war|0.02|30|
|deu|war|0.01|10|
|ekk|est|0.54|310|
|eng|nan|0.01|10|
|est|ekk|0.56|250|
|fas|pes|0.52|550|
|fin|min|0.01|10|
|fra|fin|0.01|10|
|gsw|lat|0.01|10|
|hrv|bos|0.64|1010|
|hun|nob|0.01|10|
|ind|sun|0.73|810|
|isl|fao|0.02|30|
|ita|fra|0.04|110|
|jav|afr|0.02|50|
|lav|lvs|0.35|170|
|lim|epo|0.02|30|
|ltz|vol|0.01|10|
|lvs|lav|0.03|30|
|mlt|eng|0.02|50|
|msa|ind|0.45|490|
|nan|tur|0.01|10|
|nds|plt|0.01|10|
|nep|san|0.01|10|
|nld|plt|0.01|10|
|nno|nob|0.12|130|
|nob|nno|0.62|290|
|oci|ita|0.01|10|
|pes|fas|0.57|730|
|pus|por|0.27|130|
|ron|lat|0.04|70|
|rus|mkd|0.02|10|
|slk|epo|0.01|10|
|slv|min|0.01|10|
|spa|vol|0.01|10|
|sqi|zul|0.01|30|
|sun|ind|0.60|790|
|swe|dan|0.02|30|
|tat|bak|0.03|30|
|tgl|ceb|0.01|10|
|tur|min|0.01|10|
|ukr|che|0.02|10|
|uzb|kir|0.02|10|
|vie|war|0.02|30|
|zul|swa|0.02|10|
> Consider switching lang-detection in tika-eval to open-nlp
> ----------------------------------------------------------
>
> Key: TIKA-2790
> URL: https://issues.apache.org/jira/browse/TIKA-2790
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Major
> Attachments: fra_mixed_100000_0.0_0.txt, langid_20190509.zip,
> langid_20190510.zip, langid_20190514.zip, langid_20190514_plus_minus_1.zip
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)