[ https://issues.apache.org/jira/browse/TIKA-868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13433282#comment-13433282 ]
Ken Krugler commented on TIKA-868: ---------------------------------- Hi Daniel - using the latest Tika (trunk) I get back UTF-8 as the encoding, if I pass in UTF-8 as the encoding in the content type, via metadata.set(Metadata.CONTENT_TYPE, "text/plain; charset=UTF-8"); If I don't set the CONTENT_TYPE, I get back ISO-8859-1, which also seems like the right thing. > TXT parser does not honour the specified encoding > ------------------------------------------------- > > Key: TIKA-868 > URL: https://issues.apache.org/jira/browse/TIKA-868 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Daniel Bonniot de Ruisselet > Fix For: 1.3 > > > With input text "Indanyl", the encoding is recognized as IBM500, even when > "UTF-8" is specified explicitly. > I would argue that detection should only be used when the declared > information is incorrect (saving time and avoiding wrong detection), as > proposed by Ken Krugler in TIKA-539. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira