[ 
https://issues.apache.org/jira/browse/TIKA-469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12995382#comment-12995382
 ] 

Ken Krugler commented on TIKA-469:
----------------------------------

Hi Robert - do you have an example of an HTML file?

I'm asking because if an HTML document is encoded as UTF-8, the only reasona I 
can think of for the character encoding to be messed up is if (a) the HTML meta 
tag uses an encoding name that isn't supported by Java, or (b) there is no 
charset specified in the response header or the HTML meta tags, and the 
algorithmic detection of the character encoding is also failing.

Thanks,

-- Ken

> The Parser is not correctly outputting Arabic text documents
> ------------------------------------------------------------
>
>                 Key: TIKA-469
>                 URL: https://issues.apache.org/jira/browse/TIKA-469
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Windows XP
>            Reporter: Robert Cullen
>         Attachments: TEST_WORD.doc, fever_factsheet_arabic.pdf
>
>
> The parser is not preserving the character encoding when parsing documents in 
> Arabic UTF-8, specifically with .pdf and .doc.  The resulting character 
> output is undechipherable or just question-mark symbols.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to