[jira] [Commented] (TIKA-3516) Unexpected charset IBM424_rtl detected for utf_8 file by CharsetDetector

Jira Sun, 22 May 2022 06:24:06 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17540594#comment-17540594
 ]


Luís Filipe Nassif commented on TIKA-3516:
------------------------------------------

Hi [~tallison],

I also got some html and eml html bodies being wrongly detected as IBM420 or 
IBM424 when upgrading to Tika-2.4.0. I tried above configuration to exclude 
those charsets from detection, but then parsing fails with "TikaException: 
Failed to detect the character encoding of a document" thrown by 
AutoDetectReader line 132. If I comment out the "params" section in above 
configuration, parsing exception goes away, but of course the encoding is 
wrongly detected. Any idea?

Another question, what txt charset detector take precedence by default 
Icu4jEncodingDetector or UniversalEncodingDetector? I would like to configure 
the same order used by 1.x for now.

> Unexpected charset IBM424_rtl detected for  utf_8  file by CharsetDetector
> --------------------------------------------------------------------------
>
>                 Key: TIKA-3516
>                 URL: https://issues.apache.org/jira/browse/TIKA-3516
>             Project: Tika
>          Issue Type: Bug
>          Components: detector, parser
>            Reporter: Chaitra Rajappa
>            Assignee: Tim Allison
>            Priority: Major
>             Fix For: 2.1.0
>
>
> Hi,
>  The CharsetDetector detects the wrong charset for a file as IBM424_rtl. 
>  Resulting in exception 
> *_java.nio.charset.UnsupportedCharsetException: IBM424_rtl 17 at 
> java.nio.charset.Charset.forName(Charset.java:531)_*
> I see there is also an existing ticket with the same issue thats not been 
> fixed.
> https://issues.apache.org/jira/browse/TIKA-2396
>  Please suggest the changes to fix this. 
> Versions being used:
> apache-core - 1.20
> apache-parsers-1.20
> Thanks



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (TIKA-3516) Unexpected charset IBM424_rtl detected for utf_8 file by CharsetDetector

Reply via email to