Akash created TIKA-3048:
---------------------------
Summary: Tika unable to parse html files with GB2312 charset
Key: TIKA-3048
URL: https://issues.apache.org/jira/browse/TIKA-3048
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.9
Reporter: Akash
Tika is returning junk characters when parsing chinese characters present
inside html file. Html file have charset mentioned as GB2312 explicitly.
<head><meta http-equiv=Content-Type content="text/html; charset=gb2312"><meta
name=Generator content="Microsoft Word 15 (filtered medium)">
If we remove this charset from the html meta tag, then parsing works fine.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)