Amir created PDFBOX-2252:
----------------------------

             Summary: PDFTextStripper has problem with bilingual documents
                 Key: PDFBOX-2252
                 URL: https://issues.apache.org/jira/browse/PDFBOX-2252
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.8.6
            Reporter: Amir
            Priority: Critical


When the input document of PDFTextStripper is a combination of right-to-left 
and left-to-right languages, the output characters of one language is reversed. 
A sample bilingual pdf document is attached.
PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which 
is defined as follows:     boolean isRtlDominant = rtlCount > ltrCount;
This class clearly count the number of rtl characters and decide if the whole 
content should be revered or not. It's not true, it must operate on each word, 
not the whole document.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to