[jira] [Commented] (TIKA-4255) TextAndCSVParser ignores Metadata.CONTENT_ENCODING

Tim Allison (Jira) Mon, 20 May 2024 13:29:38 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847980#comment-17847980
 ]


Tim Allison commented on TIKA-4255:
-----------------------------------

Thank you for opening this PR. Are you able to add a small unit test to confirm 
behavior? 

I can't tell from the above if you're setting {{CONTENT_TYPE_USER_OVERRIDE}} or 
if you're setting CONTENT_TYPE and ENCODING?

It looks like the code is trying to pull the encoding from the 
{{CONTENT_TYPE_USER_OVERRIDE}}. 

> TextAndCSVParser ignores Metadata.CONTENT_ENCODING
> --------------------------------------------------
>
>                 Key: TIKA-4255
>                 URL: https://issues.apache.org/jira/browse/TIKA-4255
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.6.0, 3.0.0-BETA, 2.9.2
>            Reporter: Axel Dörfler
>            Priority: Major
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> I pass a text to the auto-detect parser that just contains the text "ETL". I 
> pass on content type, and content encoding information via Metadata.
> However, TextAndCSVParser ignores the provided encoding (since CSVParams has 
> not provided via TikaCoreProperties.CONTENT_TYPE_USER_OVERRIDE), and chooses 
> to rather detect it by itself. Turns out it detects some IBM424 hebrew 
> charset, and uses that which results in a kind of surprising output.
> Tested with the mentioned versions, though the bug should be much older 
> already.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4255) TextAndCSVParser ignores Metadata.CONTENT_ENCODING

Reply via email to