[
https://issues.apache.org/jira/browse/TIKA-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648700#comment-17648700
]
Tim Allison edited comment on TIKA-3213 at 12/16/22 3:48 PM:
-------------------------------------------------------------
I looked more carefully at a few files with regressions, such as:
5CVMZWUHDV3BAJCN6TKXWSWRPRAE3LMI. This was correctly id'd as {{EUC-KR}} by the
earlier version, but incorrectly identified as {{ISO-8859-15}} by the fork.
The fork does claim to identify EUC-KR. I tried stripping out the html tags
and got the same results. Chrome was not able to detect the encoding
correctly. Our ICU4J detector does correctly identify EUC-KR. In short, this
is a regression on a file with faulty html (<meta http-equiv="Content-Type"
content="text/html; charset=">).
Overall, I think, the improvements outweigh the regressions, and it is better
to depend on a project that appears to be alive still.
For high accuracy (and to defeat incorrect meta-http headers), we could get
around to integrating tika-eval into the detector. This would effectively
parse the files with detected charsets and pick the one with the best out of
vocabulary statistic. This is for another day...
was (Author: [email protected]):
I looked more carefully at a few files with regressions, such as:
5CVMZWUHDV3BAJCN6TKXWSWRPRAE3LMI. This was correctly id'd as {{EUC-KR}} by the
earlier version, but incorrectly identified as \{{ISO-8859-15}} by the fork.
The fork does claim to identify EUC-KR. I tried stripping out the html tags
and got the same results. Chrome was not able to detect the encoding
correctly. Our ICU4J detector does correctly identify EUC-KR. In short, this
is a regression, but the improvements outweigh the regressions.
For high accuracy (and to defeat incorrect meta-http headers), we could get
around to integrating tika-eval into the detector. This would effectively
parse the files with detected charsets and pick the one with the best out of
vocabulary statistic. This is for another day...
> Consider migrating universalcharsetdetector to a live fork
> ----------------------------------------------------------
>
> Key: TIKA-3213
> URL: https://issues.apache.org/jira/browse/TIKA-3213
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
> Fix For: 2.6.1
>
> Attachments: content_diffs_no_exceptions-no-svg.xlsx
>
>
> I just came across this living fork of the aged juniversalchardet (2011!!!):
> https://github.com/albfernandez/juniversalchardet
> It has a mozilla license, has decent star count and is published on maven
> central.
> Obv, we'll want to run a comparison on our corpus before making this change,
> but I wanted to open this issue for discussion.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)