[
https://issues.apache.org/jira/browse/TIKA-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648229#comment-17648229
]
Tim Allison commented on TIKA-3213:
-----------------------------------
Reports are here:
[https://corpora.tika.apache.org/base/reports/tika-2.6.x-SNAPSHOT-juniversal-chardet.tgz]
My general takeaway is that we're getting more "common words" and we aren't
losing any. I did notice quite a few times when we're now getting detection as
shift_jis, and the encoding is just not that, but because most (all?) of the
words are in the ascii range, we don't see any problems with extraction quality.
There's an unrelated issue that makes analysis annoying. When we broadened
detection of svg to include files that don't have the xml definition header,
we're now funneling those to the XML parser and we're getting a bunch of
exceptions over before when we were treating those svg files as text.
I'll continue to look at the results in a bit more detail.
> Consider migrating universalcharsetdetector to a live fork
> ----------------------------------------------------------
>
> Key: TIKA-3213
> URL: https://issues.apache.org/jira/browse/TIKA-3213
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
>
> I just came across this living fork of the aged juniversalchardet (2011!!!):
> https://github.com/albfernandez/juniversalchardet
> It has a mozilla license, has decent star count and is published on maven
> central.
> Obv, we'll want to run a comparison on our corpus before making this change,
> but I wanted to open this issue for discussion.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)