[jira] Issue Comment Edited: (PDFBOX-684) Incorrect ordering of compound Arabic glyphs

Yigal Dayan (JIRA) Thu, 08 Apr 2010 01:18:12 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854862#action_12854862
 ]


Yigal Dayan edited comment on PDFBOX-684 at 4/8/10 8:15 AM:
------------------------------------------------------------

Attaching sample pdf and two utf8 outputs (before and after fix)

      was (Author: ydayan):
    Attaching sample pdf and two utf8 outputs (beore and after fix)
  
> Incorrect ordering of compound Arabic glyphs
> --------------------------------------------
>
>                 Key: PDFBOX-684
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-684
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.0.0, 1.1.0
>            Reporter: Yigal Dayan
>            Priority: Minor
>         Attachments: zzz.after_fix.txt, zzz.before_fix.txt, zzz.pdf
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Some Arabic PDFs contain compound glyphs for stylistic reasons.
> Such glyphs encode two letters: FI, SI, LI, LJ, LM, etc.
> Before a line gets sent to the bidirectional algorithm, all characters have 
> been sorted into a visual order, except for these pairs. This is because they 
> are handled as one unit and maintain their original (logical) order. The bidi 
> algorithm straightens out most characters, but reverses the glyph pairs.
> To fix this, the output of font.encode() should be examined and reversed on 
> the spot if it contains pairs of Arabic characters. Possibly you need to add 
> a stub method to PDFStreamEngine (in method processEncodedText) that 
> PDFTextStripper can override (in sort mode only).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (PDFBOX-684) Incorrect ordering of compound Arabic glyphs

Reply via email to