[jira] [Closed] (PDFBOX-2259) PDFTextStripper has problem with semi-space characters

John Hewson (JIRA) Thu, 25 Sep 2014 23:36:09 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


John Hewson closed PDFBOX-2259.
-------------------------------
    Resolution: Not a Problem

I took another look at this PDF, the "marked content" does not include any 
Unicode mappings for the text and the text embedded in the PDF does not contain 
any Unicode zero-width-space characters. Acrobat, Foxit, and OS X Preview 
produce the same result as PDF Box.

This isn't a bug in PDFBox but a problem with how the original PDF file was 
created, it's not uncommon to find files which display the correct characters 
on the screen but which have invalid Unicode text assigned to those characters.

> PDFTextStripper has problem with semi-space characters
> ------------------------------------------------------
>
>                 Key: PDFBOX-2259
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2259
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.6
>            Reporter: Amir
>         Attachments: test.pdf
>
>
> In some right-to-left languages, compound words are separated using 
> "semi-space" (please take a look at Unicode spaces: 
> https://www.cs.tut.fi/~jkorpela/chars/spaces.html). When the input document 
> contains these words, PDFTextStripper neglects semi-space character and 
> concatenates words together. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (PDFBOX-2259) PDFTextStripper has problem with semi-space characters

Reply via email to