[
https://issues.apache.org/jira/browse/TIKA-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tyler Palsulich closed TIKA-1050.
---------------------------------
Resolution: Cannot Reproduce
Fix Version/s: 1.6
Assignee: Tyler Palsulich
The attached file is detected as GB18030. So, I'm closing this issue. Let me
know if you're still having problems, Amit.
{code}
➜ java -jar tika-app/target/tika-app-1.6-SNAPSHOT.jar Test\ data-GB.txt
<?xml version="1.0" encoding="UTF-8"?><html
xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="Content-Length" content="403"/>
<meta name="Content-Encoding" content="GB18030"/>
<meta name="Content-Type" content="text/plain; charset=GB18030"/>
<meta name="resourceName" content="Test data-GB.txt"/>
{code}
> Charset detection gives wrong results for GB18030 encoding
> ----------------------------------------------------------
>
> Key: TIKA-1050
> URL: https://issues.apache.org/jira/browse/TIKA-1050
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.2
> Reporter: Amit Gupta
> Assignee: Tyler Palsulich
> Priority: Critical
> Fix For: 1.6
>
> Attachments: Test data-GB.txt
>
>
> CharsetDetector gives IBM866 as the charset for text file that is in GB18030.
> GB18030 gets a lower confidence than IBM866.
--
This message was sent by Atlassian JIRA
(v6.2#6252)