Ignores char spacing (Tc) and word space (Tw) when rendering PDFs to images
---------------------------------------------------------------------------

                 Key: PDFBOX-520
                 URL: https://issues.apache.org/jira/browse/PDFBOX-520
             Project: PDFBox
          Issue Type: Bug
          Components: Utilities, Writing
    Affects Versions: 0.8.0-incubator
         Environment: Java
            Reporter: Antony Scerri
         Attachments: PageDrawer.diff, PDFStreamEngine.diff

When PDFStreamEngine parses encoded text the resulting rendering to an image 
does not apply the char spacing (Tc) and word spacing (Tw)  through the 
callbacks to processTextPosition. This is because it passes the complete string 
block and a individual character width array. The character and word spacings 
are applied correctly to the matrix calculations however and so all the 
relevant information is available. 

The problem when writing PDFs to images occurs because text being rendered is 
issued through calls to AWT Font classes, these do not apply the character and 
word spacings across a whole string block. For example a PDF I have which has 
the text fully justified has been rendered so that the line is split into three 
blocks (arbitary positions in words) each block having a different word 
spacing. When rendered this comes out with three bocks of text with large gaps 
in between (or even over writing each other if the character spacing was used 
to compress the rendered string). 

Having read through the PDF spec it would seem that each glyph should be 
rendered separately calculating the next ones position taking into account the 
character and word spacing. To achieve this the attached patch modifies the 
behaviour of PDFStreamEngine to fire a single processTextPosition event for 
each character taken from the encoded string. This may or may not make the need 
for the individual widths array redundant but it has been preserved for 
backward compatability, as has the string buffer containing the resulting 
string (which would now only need to contain one character). So both these can 
be optimiized to a length of one and still preserve backward compatability 
(this patch does not include that change).

There is also a patch for PageDrawer itself which i'm not entirely sure if its 
necessary or not. It makes a minor adjustment to how the AffineTransform is 
calculated, removing two lines which previosuly were altering the text 
positions matrix. I made this change as I couldnt see why this was being done 
so more of an experiment. I could see any difference at first but on closer 
inspection I see minor changes to some characters positions (when used with the 
other patch). So together it might give an even closer rendering.

An alternative approach may be to make the page drawer render each character 
from the text position offsetting each one using the individual widths as it 
goes. However noting the behaviour of the second patch that may lead to 
slightly inaccurate renderings again, so very hard to tell with experimenting 
and very close examination of the results.

This may be related to issue PDFBOX-508 which was specifical looking at lost 
space during text extraction 

I have also looked at text extraction incase this was effecting operations like 
combining diacritics. This was in case the actual positions of characters may 
be off slightly when calculating location using the individual character width 
array as opposed to the glyphs position as it should be rendered. As it stores 
the calculated glyph position which is subsequently summed whilst scanning the 
string to look for possible overlaps they should result in the same answer. 
However this patch means that last text position processed may not be the one 
to check for the overlap in anymore. Based on a line containing multiple text 
blocks this is probably true of the normal case anyway so probably needs 
separate attention.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to