[
https://issues.apache.org/jira/browse/TIKA-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18063659#comment-18063659
]
ASF GitHub Bot commented on TIKA-4685:
--------------------------------------
tballison merged PR #2677:
URL: https://github.com/apache/tika/pull/2677
> Add a new charset detector for 4.x
> ----------------------------------
>
> Key: TIKA-4685
> URL: https://issues.apache.org/jira/browse/TIKA-4685
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
>
> While I was building out the maxent model for the updated language detector,
> I realized we had the resources (language files by language) and a maxent
> model just sitting around and ready to build a new charset detector based on
> byte ngrams.
> I have something working that appears to be quite good. We can replace both
> universal and icu4j. There's a chance that the results are hallucinated or
> that there's something surprising going on, but I think we should merge this
> and see what happens on our regression set.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)