Christopher Creutzig created PDFBOX-3833:
--------------------------------------------
Summary: Characters in wrong order
Key: PDFBOX-3833
URL: https://issues.apache.org/jira/browse/PDFBOX-3833
Project: PDFBox
Issue Type: Bug
Components: Parsing
Affects Versions: 2.0.5
Reporter: Christopher Creutzig
Attachments: ML_mathworks_unc2.pdf
The attached pdf file (which is page 3 of
https://jp.mathworks.com/tagteam/89688_93050v00_JP_machine_learning_section1_ebook.pdf)
shows multiple problems when reading with PDFBox in standard settings. This
bug report in particular is about the Katakana ー being misplaced.
In the text block on the left, the second line starts with ターン.
PDFTextStripper.getText returns text starting with タ ンー (i.e., adding a space
after the first character and swapping the second and third one). This effect
also happens at other places in the (complete) file.
The PDF itself at this point has [<03BB>43.9 <0294>156 <03EF>-24.5 ...]TJ,
listing the characters in the proper order. Copy&paste using Apple's
Preview.App also preserves that order.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]