Luís Filipe Nassif created TIKA-3774:
----------------------------------------
Summary: Fix ignoreCharsets param of Icu4jEncodingDetector
Key: TIKA-3774
URL: https://issues.apache.org/jira/browse/TIKA-3774
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 2.4.0
Reporter: Luís Filipe Nassif
Assignee: Luís Filipe Nassif
Fix For: 2.4.1
Attachments: test_avoid_IBM420_charset.html
That parameter was introduced in TIKA-3516 to avoid undesired charsets in
advance, but it is not working as expected, it is returning when first ignored
charset is found, when it should continue to next charsets. Attached
(corrupted) file used to be detected as windows-1252 by Tika-1.x, but now is
being detected as IBM420 after TIKA-3516, ignoreCharsets param should be able
to ignore IBM420. I'll push a fix shortly.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)