[ https://issues.apache.org/jira/browse/PDFBOX-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Justin LeFebvre updated PDFBOX-415: ----------------------------------- Attachment: ICU4JImpl.diff > Errors when decomposing Arabic Ligatures > ---------------------------------------- > > Key: PDFBOX-415 > URL: https://issues.apache.org/jira/browse/PDFBOX-415 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 0.7.3 > Reporter: Justin LeFebvre > Attachments: allah2.pdf, FC60_Times.pdf, ICU4JImpl.diff > > > For arabic ligatures U+FC5E to U+FC63, the decomposition of each contains a > space which causes a word to be broken up into two words. Also, the U+FDF2 > ligature is handled differently by different fonts. Some encode it as U+0644 > U+0644 U+0647 and add on an extra separate U+0627. U+FDF2 should be encoded > as U+0627 U+0644 U+0644 U+0647. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.