[
https://issues.apache.org/jira/browse/TIKA-3516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400426#comment-17400426
]
Tim Allison commented on TIKA-3516:
-----------------------------------
I took some time to dig into this. There are two problems.
1) The Icu4jEncodingDetector's CharsetMatch actually has code to strip off the
*_rtl and *_ltr in its #getString(). In other words, the developers of the
Icu4jEncodingDetector knew that you couldn't plop the CharsetMatch.getName()
directly into a charset...the user had to strip off the _rtl or _ltr from some
charsets. I added a "getNormalizedName()" that takes on the burden of
stripping those markers.
So, that's all for the good, however...
2) Now on the test file from TIKA-2396, the charset is converted to IBM424, and
complete junk is yielded for that test file. So, I added the ability to ignore
specific charsets. I experimented with a confidence value, but that wasn't
extremely informative. Unless there's a better alternative, users for now can
turn off detection of IBM424 with:
{noformat}
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser"/>
</parsers>
<encodingDetectors>
<encodingDetector
class="org.apache.tika.parser.html.HTMLEncodingDetector"/>
<encodingDetector
class="org.apache.tika.parser.txt.Icu4jEncodingDetector">
<params>
<param name="ignoreCharsets" type="list">
<string>IBM424</string>
</param>
</params>
</encodingDetector>
<encodingDetector
class="org.apache.tika.parser.txt.UniversalEncodingDetector"/>
</encodingDetectors>
</properties>
{noformat}
> Unexpected charset IBM424_rtl detected for utf_8 file by CharsetDetector
> --------------------------------------------------------------------------
>
> Key: TIKA-3516
> URL: https://issues.apache.org/jira/browse/TIKA-3516
> Project: Tika
> Issue Type: Bug
> Components: detector, parser
> Reporter: Chaitra Rajappa
> Priority: Major
>
> Hi,
> The CharsetDetector detects the wrong charset for a file as IBM424_rtl.
> Resulting in exception
> *_java.nio.charset.UnsupportedCharsetException: IBM424_rtl 17 at
> java.nio.charset.Charset.forName(Charset.java:531)_*
> I see there is also an existing ticket with the same issue thats not been
> fixed.
> https://issues.apache.org/jira/browse/TIKA-2396
> Please suggest the changes to fix this.
> Versions being used:
> apache-core - 1.20
> apache-parsers-1.20
> Thanks
--
This message was sent by Atlassian Jira
(v8.3.4#803005)