[jira] [Comment Edited] (TIKA-3213) Consider migrating universalcharsetdetector to a live fork

Tim Allison (Jira) Fri, 16 Dec 2022 07:49:06 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648700#comment-17648700
 ]


Tim Allison edited comment on TIKA-3213 at 12/16/22 3:48 PM:
-------------------------------------------------------------

I looked more carefully at a few files with regressions, such as: 
5CVMZWUHDV3BAJCN6TKXWSWRPRAE3LMI.  This was correctly id'd as {{EUC-KR}} by the 
earlier version, but incorrectly identified as {{ISO-8859-15}} by the fork.  
The fork does claim to identify EUC-KR.  I tried stripping out the html tags 
and got the same results.  Chrome was not able to detect the encoding 
correctly.  Our ICU4J detector does correctly identify EUC-KR.  In short, this 
is a regression on a file with faulty html (<meta http-equiv="Content-Type" 
content="text/html; charset=">).

Overall, I think, the improvements outweigh the regressions, and it is better 
to depend on a project that appears to be alive still.

For high accuracy (and to defeat incorrect meta-http headers), we could get 
around to integrating tika-eval into the detector.  This would effectively 
parse the files with detected charsets and pick the one with the best out of 
vocabulary statistic.  This is for another day...


was (Author: [email protected]):
I looked more carefully at a few files with regressions, such as: 
5CVMZWUHDV3BAJCN6TKXWSWRPRAE3LMI.  This was correctly id'd as {{EUC-KR}} by the 
earlier version, but incorrectly identified as \{{ISO-8859-15}} by the fork.  
The fork does claim to identify EUC-KR.  I tried stripping out the html tags 
and got the same results.  Chrome was not able to detect the encoding 
correctly.  Our ICU4J detector does correctly identify EUC-KR.  In short, this 
is a regression, but the improvements outweigh the regressions.

For high accuracy (and to defeat incorrect meta-http headers), we could get 
around to integrating tika-eval into the detector.  This would effectively 
parse the files with detected charsets and pick the one with the best out of 
vocabulary statistic.  This is for another day...

> Consider migrating universalcharsetdetector to a live fork
> ----------------------------------------------------------
>
>                 Key: TIKA-3213
>                 URL: https://issues.apache.org/jira/browse/TIKA-3213
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>             Fix For: 2.6.1
>
>         Attachments: content_diffs_no_exceptions-no-svg.xlsx
>
>
> I just came across this living fork of the aged juniversalchardet (2011!!!): 
> https://github.com/albfernandez/juniversalchardet
> It has a mozilla license, has decent star count and is published on maven 
> central.
> Obv, we'll want to run a comparison on our corpus before making this change, 
> but I wanted to open this issue for discussion.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (TIKA-3213) Consider migrating universalcharsetdetector to a live fork

Reply via email to