[
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707343#comment-16707343
]
Ken Krugler commented on TIKA-2790:
-----------------------------------
My concern with OpenNLP is that during a web crawl, even with the current
lightweight detection algorithm, the detection can add a lot of processing
time. OpenNLP is generally not known as being "lightweight" :) But we could
give it a try, for sure.
Note that OpenNLP uses ISO 639-2 (three letter codes). Having a more robust
representation of languages in the language detector API would be a good thing
in general (e.g. 639-2 code plus an optional locale code, so you can
differentiate Mandarin Chinese in Taiwan from Mandarin Chinese in China or
Singapore).
> Consider switching lang-detection in tika-eval to open-nlp
> ----------------------------------------------------------
>
> Key: TIKA-2790
> URL: https://issues.apache.org/jira/browse/TIKA-2790
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Minor
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)