[jira] [Commented] (TIKA-3516) Unexpected charset IBM424_rtl detected for utf_8 file by CharsetDetector

Tim Allison (Jira) Tue, 17 Aug 2021 07:19:11 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400426#comment-17400426
 ]


Tim Allison commented on TIKA-3516:
-----------------------------------

I took some time to dig into this.  There are two problems.

1) The Icu4jEncodingDetector's CharsetMatch actually has code to strip off the 
*_rtl and *_ltr in its #getString().  In other words, the developers of the 
Icu4jEncodingDetector knew that you couldn't plop the CharsetMatch.getName() 
directly into a charset...the user had to strip off the _rtl or _ltr from some 
charsets.  I added a "getNormalizedName()" that takes on the burden of 
stripping those markers.  

So, that's all for the good, however...
2) Now on the test file from TIKA-2396, the charset is converted to IBM424, and 
complete junk is yielded for that test file.  So, I added the ability to ignore 
specific charsets.  I experimented with a confidence value, but that wasn't 
extremely informative.  Unless there's a better alternative, users for now can 
turn off detection of IBM424 with:
{noformat}
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser"/>
    </parsers>
    <encodingDetectors>
        <encodingDetector 
class="org.apache.tika.parser.html.HTMLEncodingDetector"/>
        <encodingDetector 
class="org.apache.tika.parser.txt.Icu4jEncodingDetector">
            <params>
                <param name="ignoreCharsets" type="list">
                    <string>IBM424</string>
                </param>
            </params>
        </encodingDetector>
        <encodingDetector 
class="org.apache.tika.parser.txt.UniversalEncodingDetector"/>
    </encodingDetectors>
</properties>
{noformat}

> Unexpected charset IBM424_rtl detected for  utf_8  file by CharsetDetector
> --------------------------------------------------------------------------
>
>                 Key: TIKA-3516
>                 URL: https://issues.apache.org/jira/browse/TIKA-3516
>             Project: Tika
>          Issue Type: Bug
>          Components: detector, parser
>            Reporter: Chaitra Rajappa
>            Priority: Major
>
> Hi,
>  The CharsetDetector detects the wrong charset for a file as IBM424_rtl. 
>  Resulting in exception 
> *_java.nio.charset.UnsupportedCharsetException: IBM424_rtl 17 at 
> java.nio.charset.Charset.forName(Charset.java:531)_*
> I see there is also an existing ticket with the same issue thats not been 
> fixed.
> https://issues.apache.org/jira/browse/TIKA-2396
>  Please suggest the changes to fix this. 
> Versions being used:
> apache-core - 1.20
> apache-parsers-1.20
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3516) Unexpected charset IBM424_rtl detected for utf_8 file by CharsetDetector

Reply via email to