[
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839443#comment-16839443
]
Tim Allison commented on TIKA-2790:
-----------------------------------
[~joern], this is what running OpenNLP's langdetector on
"fra_mixed_100000_0.0_0.txt" looks like at different lengths. Note that this
file has no noise in it:
||Length||Lang1||Conf1||Lang2||Conf2||Lang3||Conf3||
|10|fra|.0130|min|.0129|spa|.0128|
|20|fra|.0184|nan|.0161|vol|.0141|
|50|fra|.0353|oci|.0226|jav|.0189|
|100|fra|.1046|oci|.0354|cat|.0338|
|200|fra|.5813|oci|.0351|cat|.0276|
|500|fra|.9867|oci|.0037|cat|.0025|
|1000|fra|.9999|oci|.0000|ltz|.0000|
|10000|fra|1.0000|che|.0000|eng|.0000|
|20000|fra|1.0000|che|.0000|tat|.0000|
|30000|fra|1.0000|che|.0000|tat|.0000|
|40000|fra|1.0000|che|.0000|tat|.0000|
|50000|fra|1.0000|che|.0000|tat|.0000|
|60000|fra|1.0000|che|.0000|tat|.0000|
|70000|fra|.9990|che|.0010|tat|.0000|
|80000|fra|.8770|che|.1228|tat|.0002|
|90000|che|.6154|fra|.3838|tat|.0008|
|100000|che|.9815|fra|.0168|tat|.0016|
> Consider switching lang-detection in tika-eval to open-nlp
> ----------------------------------------------------------
>
> Key: TIKA-2790
> URL: https://issues.apache.org/jira/browse/TIKA-2790
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Major
> Attachments: langid_20190509.zip, langid_20190510.zip,
> langid_20190514.zip
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)