[jira] [Commented] (PDFBOX-2259) PDFTextStripper has problem with semi-space characters

Amir (JIRA) Wed, 06 Aug 2014 15:09:35 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14088351#comment-14088351
 ]


Amir commented on PDFBOX-2259:
------------------------------

OK. 

I tried to inherit PDFTextStripper from PDFMarkedContentExtractor, but the 
problem is exist yet.

Would you please give me a solution to solve such problems?
Please provide me some sample code if possible.
Thanks.





> PDFTextStripper has problem with semi-space characters
> ------------------------------------------------------
>
>                 Key: PDFBOX-2259
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2259
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.6
>            Reporter: Amir
>         Attachments: test.pdf
>
>
> In some right-to-left languages, compound words are separated using 
> "semi-space" (please take a look at Unicode spaces: 
> https://www.cs.tut.fi/~jkorpela/chars/spaces.html). When the input document 
> contains these words, PDFTextStripper neglects semi-space character and 
> concatenates words together. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PDFBOX-2259) PDFTextStripper has problem with semi-space characters

Reply via email to