Luís Filipe Nassif created TIKA-3774:
----------------------------------------

             Summary: Fix ignoreCharsets param of Icu4jEncodingDetector
                 Key: TIKA-3774
                 URL: https://issues.apache.org/jira/browse/TIKA-3774
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 2.4.0
            Reporter: Luís Filipe Nassif
            Assignee: Luís Filipe Nassif
             Fix For: 2.4.1
         Attachments: test_avoid_IBM420_charset.html

That parameter was introduced in TIKA-3516 to avoid undesired charsets in 
advance, but it is not working as expected, it is returning when first ignored 
charset is found, when it should continue to next charsets. Attached 
(corrupted) file used to be detected as windows-1252 by Tika-1.x, but now is 
being detected as IBM420 after TIKA-3516, ignoreCharsets param should be able 
to ignore IBM420. I'll push a fix shortly.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to