[
https://issues.apache.org/jira/browse/TIKA-4255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17846908#comment-17846908
]
ASF GitHub Bot commented on TIKA-4255:
--------------------------------------
axeld opened a new pull request, #1761:
URL: https://github.com/apache/tika/pull/1761
If CSVParams.getCharset() is null, the passed in encoding is used before
trying to auto detect it.
> TextAndCSVParser ignores Metadata.CONTENT_ENCODING
> --------------------------------------------------
>
> Key: TIKA-4255
> URL: https://issues.apache.org/jira/browse/TIKA-4255
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 2.6.0, 3.0.0-BETA, 2.9.2
> Reporter: Axel Dörfler
> Priority: Major
> Original Estimate: 0.5h
> Remaining Estimate: 0.5h
>
> I pass a text to the auto-detect parser that just contains the text "ETL". I
> pass on content type, and content encoding information via Metadata.
> However, TextAndCSVParser ignores the provided encoding (since CSVParams has
> not provided via TikaCoreProperties.CONTENT_TYPE_USER_OVERRIDE), and chooses
> to rather detect it by itself. Turns out it detects some IBM424 hebrew
> charset, and uses that which results in a kind of surprising output.
> Tested with the mentioned versions, though the bug should be much older
> already.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)