Dubious handling of word spacing (Tw)
-------------------------------------

                 Key: PDFBOX-571
                 URL: https://issues.apache.org/jira/browse/PDFBOX-571
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction, Utilities
    Affects Versions: 0.8.0-incubator
            Reporter: Villu Ruusmann
         Attachments: pg_0005.pdf

Wanted to provide a contrary case for the current handling of word spacing.

The sample page (pg_0005.pdf) uses a Type1C font for text rendering. The 
problem is that this Type1C font uses a custom encoding where the code values 
are assigned sequentially starting from the code value of 1. Thus the code 
value 32 is assigned to a digit "3", not to a space character " " as one would 
expect.

The PDF producer software has (mis-)used word spacing to break up longer 
character sequences. For example, on table line 3, the character sequence 
"0.831.05" is broken into two cells "0.83" and "1.05". Other uses of this 
"optimization" can be seen when the sample page is opened in Acrobat Reader 
(tested on version 7.0) and the "Select all" operation is performed. I've 
attached the screenshot of Acrobat Reader (page_0005_selectall.png) for your 
convenience.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to