[ 
https://issues.apache.org/jira/browse/PDFBOX-571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Villu Ruusmann updated PDFBOX-571:
----------------------------------

    Attachment: pg_0005.pdf

> Dubious handling of word spacing (Tw)
> -------------------------------------
>
>                 Key: PDFBOX-571
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-571
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction, Utilities
>    Affects Versions: 0.8.0-incubator
>            Reporter: Villu Ruusmann
>         Attachments: pg_0005.pdf
>
>
> Wanted to provide a contrary case for the current handling of word spacing.
> The sample page (pg_0005.pdf) uses a Type1C font for text rendering. The 
> problem is that this Type1C font uses a custom encoding where the code values 
> are assigned sequentially starting from the code value of 1. Thus the code 
> value 32 is assigned to a digit "3", not to a space character " " as one 
> would expect.
> The PDF producer software has (mis-)used word spacing to break up longer 
> character sequences. For example, on table line 3, the character sequence 
> "0.831.05" is broken into two cells "0.83" and "1.05". Other uses of this 
> "optimization" can be seen when the sample page is opened in Acrobat Reader 
> (tested on version 7.0) and the "Select all" operation is performed. I've 
> attached the screenshot of Acrobat Reader (page_0005_selectall.png) for your 
> convenience.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to