Axel Dörfler created TIKA-4255:
----------------------------------
Summary: TextAndCSVParser ignores Metadata.CONTENT_ENCODING
Key: TIKA-4255
URL: https://issues.apache.org/jira/browse/TIKA-4255
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 2.9.2, 3.0.0-BETA, 2.6.0
Reporter: Axel Dörfler
I pass a text to the auto-detect parser that just contains the text "ETL". I
pass on content type, and content encoding information via Metadata.
However, TextAndCSVParser ignores the provided encoding (since CSVParams has
not provided via TikaCoreProperties.CONTENT_TYPE_USER_OVERRIDE), and chooses to
rather detect it by itself. Turns out it detects some IBM424 hebrew charset,
and uses that which results in a kind of surprising output.
Tested with the mentioned versions, though the bug should be much older already.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)