[
https://issues.apache.org/jira/browse/PDFBOX-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854862#action_12854862
]
Yigal Dayan edited comment on PDFBOX-684 at 4/8/10 8:15 AM:
------------------------------------------------------------
Attaching sample pdf and two utf8 outputs (before and after fix)
was (Author: ydayan):
Attaching sample pdf and two utf8 outputs (beore and after fix)
> Incorrect ordering of compound Arabic glyphs
> --------------------------------------------
>
> Key: PDFBOX-684
> URL: https://issues.apache.org/jira/browse/PDFBOX-684
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.0.0, 1.1.0
> Reporter: Yigal Dayan
> Priority: Minor
> Attachments: zzz.after_fix.txt, zzz.before_fix.txt, zzz.pdf
>
> Original Estimate: 3h
> Remaining Estimate: 3h
>
> Some Arabic PDFs contain compound glyphs for stylistic reasons.
> Such glyphs encode two letters: FI, SI, LI, LJ, LM, etc.
> Before a line gets sent to the bidirectional algorithm, all characters have
> been sorted into a visual order, except for these pairs. This is because they
> are handled as one unit and maintain their original (logical) order. The bidi
> algorithm straightens out most characters, but reverses the glyph pairs.
> To fix this, the output of font.encode() should be examined and reversed on
> the spot if it contains pairs of Arabic characters. Possibly you need to add
> a stub method to PDFStreamEngine (in method processEncodedText) that
> PDFTextStripper can override (in sort mode only).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.