[
https://issues.apache.org/jira/browse/PDFBOX-571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Villu Ruusmann updated PDFBOX-571:
----------------------------------
Attachment: pg_0005.pdf
> Dubious handling of word spacing (Tw)
> -------------------------------------
>
> Key: PDFBOX-571
> URL: https://issues.apache.org/jira/browse/PDFBOX-571
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction, Utilities
> Affects Versions: 0.8.0-incubator
> Reporter: Villu Ruusmann
> Attachments: pg_0005.pdf
>
>
> Wanted to provide a contrary case for the current handling of word spacing.
> The sample page (pg_0005.pdf) uses a Type1C font for text rendering. The
> problem is that this Type1C font uses a custom encoding where the code values
> are assigned sequentially starting from the code value of 1. Thus the code
> value 32 is assigned to a digit "3", not to a space character " " as one
> would expect.
> The PDF producer software has (mis-)used word spacing to break up longer
> character sequences. For example, on table line 3, the character sequence
> "0.831.05" is broken into two cells "0.83" and "1.05". Other uses of this
> "optimization" can be seen when the sample page is opened in Acrobat Reader
> (tested on version 7.0) and the "Select all" operation is performed. I've
> attached the screenshot of Acrobat Reader (page_0005_selectall.png) for your
> convenience.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.