[
https://issues.apache.org/jira/browse/TIKA-3516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17540594#comment-17540594
]
Luís Filipe Nassif commented on TIKA-3516:
------------------------------------------
Hi [~tallison],
I also got some html and eml html bodies being wrongly detected as IBM420 or
IBM424 when upgrading to Tika-2.4.0. I tried above configuration to exclude
those charsets from detection, but then parsing fails with "TikaException:
Failed to detect the character encoding of a document" thrown by
AutoDetectReader line 132. If I comment out the "params" section in above
configuration, parsing exception goes away, but of course the encoding is
wrongly detected. Any idea?
Another question, what txt charset detector take precedence by default
Icu4jEncodingDetector or UniversalEncodingDetector? I would like to configure
the same order used by 1.x for now.
> Unexpected charset IBM424_rtl detected for utf_8 file by CharsetDetector
> --------------------------------------------------------------------------
>
> Key: TIKA-3516
> URL: https://issues.apache.org/jira/browse/TIKA-3516
> Project: Tika
> Issue Type: Bug
> Components: detector, parser
> Reporter: Chaitra Rajappa
> Assignee: Tim Allison
> Priority: Major
> Fix For: 2.1.0
>
>
> Hi,
> The CharsetDetector detects the wrong charset for a file as IBM424_rtl.
> Resulting in exception
> *_java.nio.charset.UnsupportedCharsetException: IBM424_rtl 17 at
> java.nio.charset.Charset.forName(Charset.java:531)_*
> I see there is also an existing ticket with the same issue thats not been
> fixed.
> https://issues.apache.org/jira/browse/TIKA-2396
> Please suggest the changes to fix this.
> Versions being used:
> apache-core - 1.20
> apache-parsers-1.20
> Thanks
--
This message was sent by Atlassian Jira
(v8.20.7#820007)