Christian Kohlschütter created PDFBOX-1652:
----------------------------------------------

             Summary: TextPosition: Japanese alphabetic characters 30fc and 
3005 treated as diacritics
                 Key: PDFBOX-1652
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1652
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.8.1
            Reporter: Christian Kohlschütter
         Attachments: PDFBOX-1652.patch

For the purpose of determining the position in text, the Japanese characters 
U+30fc (KATAKANA-HIRAGANA PROLONGED SOUND MARK) and U+3005 (IDEOGRAPHIC 
ITERATION MARK) are currently regarded "simple" diacritics. Apparently, they 
are fully-fledged characters in terms of text positioning.

This can have the effect that when extracting text, some characters get 
actually reversed (particularly ーン can get ンー).

A patch to fix this is attached.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to