Arabic / Farsi (Persian) text appear disconnected when PDF is converted to image
--------------------------------------------------------------------------------

                 Key: PDFBOX-1216
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1216
             Project: PDFBox
          Issue Type: Bug
    Affects Versions: 1.6.0
            Reporter: Hamed Iravanchi


When the PDF file contains Arabic / Farsi text, they appear disconnected when 
converting pages to image.
Arabic / Farsi letters are connected to each other when written.

Additionally, the error message "Changing font on <?> from <B Lotus> to the 
default font" appears on the console.
As I tried to debug the issue, it is because PDFBox is looking into the 
embedded fonts for the "isolated" variation of the character, where the 
embedded font only includes "connected" variation.
If the embedded font contains the isolated format too, the font is displayed 
correctly (the warning message doesn't appear for that character), but the 
character is displayed as the incorrect variation (i.e. isolated instead of 
connected)

This happens in both 1.6.0 release and the latest trunk code (as of today). I 
didn't test previous versions.
The difference is that in 1.6.0, the default font (that is substituted as 
mentioned above) contains the Arabic / Persian characters, but in the trunk, 
the replaced characters are displayed as squares.

I will attach a PDF as an input for reproducing the issue.

Note: this might be related to issue PDFBOX-1127, but that one regards text 
extraction.




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to