Amir created PDFBOX-2259:
----------------------------

             Summary: PDFTextStripper has problem with semi-space characters
                 Key: PDFBOX-2259
                 URL: https://issues.apache.org/jira/browse/PDFBOX-2259
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.8.6
            Reporter: Amir
            Priority: Critical
         Attachments: test.pdf

In some right-to-left languages, compound words are separated using 
"semi-space" (please take a look at Unicode spaces: 
https://www.cs.tut.fi/~jkorpela/chars/spaces.html). When the input document 
contains these words, PDFTextStripper neglects semi-space character and 
concatenates words together. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to