[
https://issues.apache.org/jira/browse/PDFBOX-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
John Hewson closed PDFBOX-2259.
-------------------------------
Resolution: Not a Problem
I took another look at this PDF, the "marked content" does not include any
Unicode mappings for the text and the text embedded in the PDF does not contain
any Unicode zero-width-space characters. Acrobat, Foxit, and OS X Preview
produce the same result as PDF Box.
This isn't a bug in PDFBox but a problem with how the original PDF file was
created, it's not uncommon to find files which display the correct characters
on the screen but which have invalid Unicode text assigned to those characters.
> PDFTextStripper has problem with semi-space characters
> ------------------------------------------------------
>
> Key: PDFBOX-2259
> URL: https://issues.apache.org/jira/browse/PDFBOX-2259
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.6
> Reporter: Amir
> Attachments: test.pdf
>
>
> In some right-to-left languages, compound words are separated using
> "semi-space" (please take a look at Unicode spaces:
> https://www.cs.tut.fi/~jkorpela/chars/spaces.html). When the input document
> contains these words, PDFTextStripper neglects semi-space character and
> concatenates words together.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)