[
https://issues.apache.org/jira/browse/PDFBOX-520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andreas Lehmkühler resolved PDFBOX-520.
---------------------------------------
Resolution: Fixed
Fix Version/s: 1.0.0
Works fine, set this issue to resolved
> Ignores char spacing (Tc) and word space (Tw) when rendering PDFs to images
> ---------------------------------------------------------------------------
>
> Key: PDFBOX-520
> URL: https://issues.apache.org/jira/browse/PDFBOX-520
> Project: PDFBox
> Issue Type: Bug
> Components: Utilities, Writing
> Affects Versions: 0.8.0-incubator
> Environment: Java
> Reporter: Antony Scerri
> Fix For: 1.0.0
>
> Attachments: PageDrawer.diff, PDFStreamEngine.diff,
> PDFStreamEngine.patch
>
>
> When PDFStreamEngine parses encoded text the resulting rendering to an image
> does not apply the char spacing (Tc) and word spacing (Tw) through the
> callbacks to processTextPosition. This is because it passes the complete
> string block and a individual character width array. The character and word
> spacings are applied correctly to the matrix calculations however and so all
> the relevant information is available.
> The problem when writing PDFs to images occurs because text being rendered is
> issued through calls to AWT Font classes, these do not apply the character
> and word spacings across a whole string block. For example a PDF I have which
> has the text fully justified has been rendered so that the line is split into
> three blocks (arbitary positions in words) each block having a different word
> spacing. When rendered this comes out with three bocks of text with large
> gaps in between (or even over writing each other if the character spacing was
> used to compress the rendered string).
> Having read through the PDF spec it would seem that each glyph should be
> rendered separately calculating the next ones position taking into account
> the character and word spacing. To achieve this the attached patch modifies
> the behaviour of PDFStreamEngine to fire a single processTextPosition event
> for each character taken from the encoded string. This may or may not make
> the need for the individual widths array redundant but it has been preserved
> for backward compatability, as has the string buffer containing the resulting
> string (which would now only need to contain one character). So both these
> can be optimiized to a length of one and still preserve backward
> compatability (this patch does not include that change).
> There is also a patch for PageDrawer itself which i'm not entirely sure if
> its necessary or not. It makes a minor adjustment to how the AffineTransform
> is calculated, removing two lines which previosuly were altering the text
> positions matrix. I made this change as I couldnt see why this was being done
> so more of an experiment. I could see any difference at first but on closer
> inspection I see minor changes to some characters positions (when used with
> the other patch). So together it might give an even closer rendering.
> An alternative approach may be to make the page drawer render each character
> from the text position offsetting each one using the individual widths as it
> goes. However noting the behaviour of the second patch that may lead to
> slightly inaccurate renderings again, so very hard to tell with experimenting
> and very close examination of the results.
> This may be related to issue PDFBOX-508 which was specifical looking at lost
> space during text extraction
> I have also looked at text extraction incase this was effecting operations
> like combining diacritics. This was in case the actual positions of
> characters may be off slightly when calculating location using the individual
> character width array as opposed to the glyphs position as it should be
> rendered. As it stores the calculated glyph position which is subsequently
> summed whilst scanning the string to look for possible overlaps they should
> result in the same answer. However this patch means that last text position
> processed may not be the one to check for the overlap in anymore. Based on a
> line containing multiple text blocks this is probably true of the normal case
> anyway so probably needs separate attention.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.