Amir created PDFBOX-2259:
----------------------------
Summary: PDFTextStripper has problem with semi-space characters
Key: PDFBOX-2259
URL: https://issues.apache.org/jira/browse/PDFBOX-2259
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.8.6
Reporter: Amir
Priority: Critical
Attachments: test.pdf
In some right-to-left languages, compound words are separated using
"semi-space" (please take a look at Unicode spaces:
https://www.cs.tut.fi/~jkorpela/chars/spaces.html). When the input document
contains these words, PDFTextStripper neglects semi-space character and
concatenates words together.
--
This message was sent by Atlassian JIRA
(v6.2#6252)