TOMER MAHLIN created PDFBOX-3096:
------------------------------------

             Summary: Lack of Bidi (Arabic / Hebrew) test reordering in text 
extracted with PDFbox
                 Key: PDFBOX-3096
                 URL: https://issues.apache.org/jira/browse/PDFBOX-3096
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
            Reporter: TOMER MAHLIN


Rendering rules for Bidi (Arabic / Hebrew) text in regular Windows / Android / 
iOS environment and Adobe environment are different. Adobe expect text to 
appear in visual bidi layout while modern system are working with logical bidi 
layout. 
When text is extracted from PDF file it should be converted / normalized to 
logical bidi layout. 
Example:
Assuming capital letters stand for Bidi letters.
1. In Adobe document you see: CBA
2. When you extract the content and display it in Notepad (or web browser or 
any similar tool) you will see ABC while the expectation is to see CBA. 

Assuming you have a real text with both Hebrew and English (or Arabic and 
English) characters the result display is completely ruined after text 
extraction. Moreover, even if we ignore the display and focus on text 
manipulation (search, comparison, concatenation etc.), it will fail if the same 
text authored in Notepad and extracted from PDF file are compared. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to