Incorrect ordering of compound Arabic glyphs
--------------------------------------------

                 Key: PDFBOX-684
                 URL: https://issues.apache.org/jira/browse/PDFBOX-684
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.1.0, 1.0.0
            Reporter: Yigal Dayan
            Priority: Minor


Some Arabic PDFs contain compound glyphs for stylistic reasons.
Such glyphs encode two letters: FI, SI, LI, LJ, LM, etc.

Before a line gets sent to the bidirectional algorithm, all characters have 
been sorted into a visual order, except for these pairs. This is because they 
are handled as one unit and maintain their original (logical) order. The bidi 
algorithm straightens out most characters, but reverses the glyph pairs.

To fix this, the output of font.encode() should be examined and reversed on the 
spot if it contains pairs of Arabic characters. Possibly you need to add a stub 
method to PDFStreamEngine (in method processEncodedText) that PDFTextStripper 
can override (in sort mode only).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to