[ https://issues.apache.org/jira/browse/TIKA-4255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847980#comment-17847980 ]
Tim Allison commented on TIKA-4255: ----------------------------------- Thank you for opening this PR. Are you able to add a small unit test to confirm behavior? I can't tell from the above if you're setting {{CONTENT_TYPE_USER_OVERRIDE}} or if you're setting CONTENT_TYPE and ENCODING? It looks like the code is trying to pull the encoding from the {{CONTENT_TYPE_USER_OVERRIDE}}. > TextAndCSVParser ignores Metadata.CONTENT_ENCODING > -------------------------------------------------- > > Key: TIKA-4255 > URL: https://issues.apache.org/jira/browse/TIKA-4255 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 2.6.0, 3.0.0-BETA, 2.9.2 > Reporter: Axel Dörfler > Priority: Major > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > I pass a text to the auto-detect parser that just contains the text "ETL". I > pass on content type, and content encoding information via Metadata. > However, TextAndCSVParser ignores the provided encoding (since CSVParams has > not provided via TikaCoreProperties.CONTENT_TYPE_USER_OVERRIDE), and chooses > to rather detect it by itself. Turns out it detects some IBM424 hebrew > charset, and uses that which results in a kind of surprising output. > Tested with the mentioned versions, though the bug should be much older > already. -- This message was sent by Atlassian Jira (v8.20.10#820010)