[
https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707320#comment-16707320
]
Tim Allison edited comment on TIKA-2790 at 12/3/18 3:06 PM:
------------------------------------------------------------
We're currently using optimaize in tika-eval. OpenNLP appears to have better
coverage, and seems to be a healthier/more active project.
So, no, not really...once I fixed the regex problem (TIKA-2777). :D But more
coverage might be nice.
The other item is that I'd like to update our "common words" counts, and I
notice that I can easily check-out a large chunk of leipzig from opennlp:
https://svn.apache.org/repos/bigdata/opennlp/trunk
So, rather than having to do my own download of wikis, one by one, I can
download a bunch of data easily, and that data would align with the language
detection.
What's your recommendation?
was (Author: [email protected]):
We're currently using optimaize in tika-eval. OpenNLP appears to have better
coverage, and seems to be a healthier/more active project.
So, no, not really...once I fixed the regex problem (TIKA-2777). :D But more
coverage might be nice.
What's your recommendation?
> Consider switching lang-detection in tika-eval to open-nlp
> ----------------------------------------------------------
>
> Key: TIKA-2790
> URL: https://issues.apache.org/jira/browse/TIKA-2790
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Minor
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)