Christian Kohlschütter created PDFBOX-1652:
----------------------------------------------
Summary: TextPosition: Japanese alphabetic characters 30fc and
3005 treated as diacritics
Key: PDFBOX-1652
URL: https://issues.apache.org/jira/browse/PDFBOX-1652
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.8.1
Reporter: Christian Kohlschütter
Attachments: PDFBOX-1652.patch
For the purpose of determining the position in text, the Japanese characters
U+30fc (KATAKANA-HIRAGANA PROLONGED SOUND MARK) and U+3005 (IDEOGRAPHIC
ITERATION MARK) are currently regarded "simple" diacritics. Apparently, they
are fully-fledged characters in terms of text positioning.
This can have the effect that when extracting text, some characters get
actually reversed (particularly ーン can get ンー).
A patch to fix this is attached.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira