[
https://issues.apache.org/jira/browse/TIKA-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16059658#comment-16059658
]
Tilman Hausherr commented on TIKA-2256:
---------------------------------------
Tim is correct. IMHO this issue should be closed as "not a problem". Microsoft
Word is to blame, [you should attempt to report the bug to
them|http://www.schveiguy.com/blog/2017/05/how-to-report-a-bug-to-microsoft].
Mention that their ToUnicode stream (that is what Tim quoted) is wrong in the
PDF. It can be found at
{{Root/Pages/Kids/\[0]/Resources/Font/G1/ToUnicode}}
with PDFDebugger.
> Japanese character substituted when reading PDF
> -----------------------------------------------
>
> Key: TIKA-2256
> URL: https://issues.apache.org/jira/browse/TIKA-2256
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.14
> Reporter: Christopher Creutzig
> Attachments: mixed-fonts.pdf
>
>
> The attached file contains “日本語” in its first line. It was created on Mac OS
> X 10.11.6 by selecting “Save As PDF” in the system print dialog started from
> Microsoft Word.
> Reading the text from the PDF, the first character is not read as U+65E5, but
> as U+2F47. Copy & paste from Preview.App results in the correct U+65E5 being
> copied. (The characters look the same in some fonts, but are different.)
> The MATLAB code used for reading looks as follows:
> handler = org.apache.tika.sax.ToXMLContentHandler;
> parser = org.apache.tika.parser.AutoDetectParser;
> metadata = org.apache.tika.metadata.Metadata;
> fh = java.io.FileInputStream(fullname);
> parser.parse(fh, handler, metadata);
> s = handler.toString;
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)