[jira] [Commented] (TIKA-4685) Add a new charset detector for 4.x

ASF GitHub Bot (Jira) Fri, 06 Mar 2026 14:23:07 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18063659#comment-18063659
 ]


ASF GitHub Bot commented on TIKA-4685:
--------------------------------------

tballison merged PR #2677:
URL: https://github.com/apache/tika/pull/2677




> Add a new charset detector for 4.x
> ----------------------------------
>
>                 Key: TIKA-4685
>                 URL: https://issues.apache.org/jira/browse/TIKA-4685
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> While I was building out the maxent model for the updated language detector, 
> I realized we had the resources (language files by language) and a maxent 
> model just sitting around and ready to build a new charset detector based on 
> byte ngrams.
> I have something working that appears to be quite good. We can replace both 
> universal and icu4j. There's a chance that the results are hallucinated or 
> that there's something surprising going on, but I think we should merge this 
> and see what happens on our regression set.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4685) Add a new charset detector for 4.x

Reply via email to