[
https://issues.apache.org/jira/browse/PDFBOX-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14088254#comment-14088254
]
Amir edited comment on PDFBOX-2259 at 8/6/14 9:08 PM:
------------------------------------------------------
I'm not sure what is the equivalent character for semi-space. I think it's a
"ZERO WIDTH SPACE". For example, check the attached document, it contains
"نیمفاصلهها", this word is in Persian and compounds of "نیم" + "فاصله" + "ها"
which have been concatenated via semi-space ("ZERO WIDTH SPACE"). The output of
PDFTextStripper is "نیمفاصلهها". It's incorrect.
was (Author: amirjadidi):
I'm not sure what is the equivalent character for semi-space. I think it's a
"ZERO WIDTH SPACE". For example, check the attached document, it contains
"نیمفاصلهها", this word is in Persian and compounds of "نیم"+"فاصله"+"ها"
which have been concatenated via semi-space ("ZERO WIDTH SPACE"). The output of
PDFTextStripper is "نیمفاصلهها". It's incorrect.
> PDFTextStripper has problem with semi-space characters
> ------------------------------------------------------
>
> Key: PDFBOX-2259
> URL: https://issues.apache.org/jira/browse/PDFBOX-2259
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.6
> Reporter: Amir
> Attachments: test.pdf
>
>
> In some right-to-left languages, compound words are separated using
> "semi-space" (please take a look at Unicode spaces:
> https://www.cs.tut.fi/~jkorpela/chars/spaces.html). When the input document
> contains these words, PDFTextStripper neglects semi-space character and
> concatenates words together.
--
This message was sent by Atlassian JIRA
(v6.2#6252)