[ 
https://issues.apache.org/jira/browse/PDFBOX-520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler resolved PDFBOX-520.
---------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0.0

Works fine, set this issue to resolved

> Ignores char spacing (Tc) and word space (Tw) when rendering PDFs to images
> ---------------------------------------------------------------------------
>
>                 Key: PDFBOX-520
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-520
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Utilities, Writing
>    Affects Versions: 0.8.0-incubator
>         Environment: Java
>            Reporter: Antony Scerri
>             Fix For: 1.0.0
>
>         Attachments: PageDrawer.diff, PDFStreamEngine.diff, 
> PDFStreamEngine.patch
>
>
> When PDFStreamEngine parses encoded text the resulting rendering to an image 
> does not apply the char spacing (Tc) and word spacing (Tw)  through the 
> callbacks to processTextPosition. This is because it passes the complete 
> string block and a individual character width array. The character and word 
> spacings are applied correctly to the matrix calculations however and so all 
> the relevant information is available. 
> The problem when writing PDFs to images occurs because text being rendered is 
> issued through calls to AWT Font classes, these do not apply the character 
> and word spacings across a whole string block. For example a PDF I have which 
> has the text fully justified has been rendered so that the line is split into 
> three blocks (arbitary positions in words) each block having a different word 
> spacing. When rendered this comes out with three bocks of text with large 
> gaps in between (or even over writing each other if the character spacing was 
> used to compress the rendered string). 
> Having read through the PDF spec it would seem that each glyph should be 
> rendered separately calculating the next ones position taking into account 
> the character and word spacing. To achieve this the attached patch modifies 
> the behaviour of PDFStreamEngine to fire a single processTextPosition event 
> for each character taken from the encoded string. This may or may not make 
> the need for the individual widths array redundant but it has been preserved 
> for backward compatability, as has the string buffer containing the resulting 
> string (which would now only need to contain one character). So both these 
> can be optimiized to a length of one and still preserve backward 
> compatability (this patch does not include that change).
> There is also a patch for PageDrawer itself which i'm not entirely sure if 
> its necessary or not. It makes a minor adjustment to how the AffineTransform 
> is calculated, removing two lines which previosuly were altering the text 
> positions matrix. I made this change as I couldnt see why this was being done 
> so more of an experiment. I could see any difference at first but on closer 
> inspection I see minor changes to some characters positions (when used with 
> the other patch). So together it might give an even closer rendering.
> An alternative approach may be to make the page drawer render each character 
> from the text position offsetting each one using the individual widths as it 
> goes. However noting the behaviour of the second patch that may lead to 
> slightly inaccurate renderings again, so very hard to tell with experimenting 
> and very close examination of the results.
> This may be related to issue PDFBOX-508 which was specifical looking at lost 
> space during text extraction 
> I have also looked at text extraction incase this was effecting operations 
> like combining diacritics. This was in case the actual positions of 
> characters may be off slightly when calculating location using the individual 
> character width array as opposed to the glyphs position as it should be 
> rendered. As it stores the calculated glyph position which is subsequently 
> summed whilst scanning the string to look for possible overlaps they should 
> result in the same answer. However this patch means that last text position 
> processed may not be the one to check for the overlap in anymore. Based on a 
> line containing multiple text blocks this is probably true of the normal case 
> anyway so probably needs separate attention.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to