[
https://issues.apache.org/jira/browse/PDFBOX-520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783104#action_12783104
]
Villu Ruusmann commented on PDFBOX-520:
---------------------------------------
Elaborated Antony's work.
Instead of producing a TextPosition instance for every character, I chose to do
so for arrays of adjacent characters. Two characters are deemed adjacent if
there is no spacing (either arising from char spacing Tc or word spacing Tw)
between them.
Some scenarios:
*) Both Tc and Tw are equal to 0. Only one TextPosition is produced. This is
the most common behaviour (> 80% cases).
*) Tc is not equal to 0. n TextPositions are produced where n is the number of
characters.
*) Tw is not equal to 0. n + 1 TextPositions are produced where n is the number
of occurrences of the ASCII space character 0x20.
One of the implications of this change is that the width of the TextPosition
(TextPosition#getWidth) now equals the sum of widths of its constituent
characters (TextPosition#getIndividualWidths). Before it wasn't so, because the
width of each constituent character was off from the correct value (as given by
PDFont#getStringWidth(String)) by the average spacing.
Some points for further consideration:
*) Single invocation of PDFStreamEngine#processEncodedText(byte[]) can result
in multiple invocations of PDFStreamEngine#processTextPosition(TextPosition).
Should TextPosition manage references to its previous/next neighbour?
*) TextPosition has redundant property wordSpacing. Should not it be removed?
> Ignores char spacing (Tc) and word space (Tw) when rendering PDFs to images
> ---------------------------------------------------------------------------
>
> Key: PDFBOX-520
> URL: https://issues.apache.org/jira/browse/PDFBOX-520
> Project: PDFBox
> Issue Type: Bug
> Components: Utilities, Writing
> Affects Versions: 0.8.0-incubator
> Environment: Java
> Reporter: Antony Scerri
> Attachments: PageDrawer.diff, PDFStreamEngine.diff
>
>
> When PDFStreamEngine parses encoded text the resulting rendering to an image
> does not apply the char spacing (Tc) and word spacing (Tw) through the
> callbacks to processTextPosition. This is because it passes the complete
> string block and a individual character width array. The character and word
> spacings are applied correctly to the matrix calculations however and so all
> the relevant information is available.
> The problem when writing PDFs to images occurs because text being rendered is
> issued through calls to AWT Font classes, these do not apply the character
> and word spacings across a whole string block. For example a PDF I have which
> has the text fully justified has been rendered so that the line is split into
> three blocks (arbitary positions in words) each block having a different word
> spacing. When rendered this comes out with three bocks of text with large
> gaps in between (or even over writing each other if the character spacing was
> used to compress the rendered string).
> Having read through the PDF spec it would seem that each glyph should be
> rendered separately calculating the next ones position taking into account
> the character and word spacing. To achieve this the attached patch modifies
> the behaviour of PDFStreamEngine to fire a single processTextPosition event
> for each character taken from the encoded string. This may or may not make
> the need for the individual widths array redundant but it has been preserved
> for backward compatability, as has the string buffer containing the resulting
> string (which would now only need to contain one character). So both these
> can be optimiized to a length of one and still preserve backward
> compatability (this patch does not include that change).
> There is also a patch for PageDrawer itself which i'm not entirely sure if
> its necessary or not. It makes a minor adjustment to how the AffineTransform
> is calculated, removing two lines which previosuly were altering the text
> positions matrix. I made this change as I couldnt see why this was being done
> so more of an experiment. I could see any difference at first but on closer
> inspection I see minor changes to some characters positions (when used with
> the other patch). So together it might give an even closer rendering.
> An alternative approach may be to make the page drawer render each character
> from the text position offsetting each one using the individual widths as it
> goes. However noting the behaviour of the second patch that may lead to
> slightly inaccurate renderings again, so very hard to tell with experimenting
> and very close examination of the results.
> This may be related to issue PDFBOX-508 which was specifical looking at lost
> space during text extraction
> I have also looked at text extraction incase this was effecting operations
> like combining diacritics. This was in case the actual positions of
> characters may be off slightly when calculating location using the individual
> character width array as opposed to the glyphs position as it should be
> rendered. As it stores the calculated glyph position which is subsequently
> summed whilst scanning the string to look for possible overlaps they should
> result in the same answer. However this patch means that last text position
> processed may not be the one to check for the overlap in anymore. Based on a
> line containing multiple text blocks this is probably true of the normal case
> anyway so probably needs separate attention.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.