Akash created TIKA-3048:
---------------------------

             Summary: Tika unable to parse html files with GB2312 charset
                 Key: TIKA-3048
                 URL: https://issues.apache.org/jira/browse/TIKA-3048
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.9
            Reporter: Akash


Tika is returning junk characters when parsing chinese characters present 
inside html file. Html file have charset mentioned as GB2312 explicitly.

<head><meta http-equiv=Content-Type content="text/html; charset=gb2312"><meta 
name=Generator content="Microsoft Word 15 (filtered medium)">

 

If we remove this charset from the html meta tag, then parsing works fine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to