Tim Allison created TIKA-4685:
---------------------------------
Summary: Add a new charset detector for 4.x
Key: TIKA-4685
URL: https://issues.apache.org/jira/browse/TIKA-4685
Project: Tika
Issue Type: Task
Reporter: Tim Allison
While I was building out the maxent model for the updated language detector, I
realized we had the resources (language files by language) and a maxent model
just sitting around and ready to build a new charset detector based on byte
ngrams.
I have something working that appears to be quite good. We can replace both
universal and icu4j. There's a chance that the results are hallucinated or that
there's something surprising going on, but I think we should merge this and see
what happens on our regression set.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)