[
https://issues.apache.org/jira/browse/PDFBOX-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15441563#comment-15441563
]
Tilman Hausherr edited comment on PDFBOX-2984 at 8/27/16 3:02 PM:
------------------------------------------------------------------
This patch uses two of three of the changes of [~dtd0], but in a different way.
The last test file refutes my fear ("but what if the negative number is in the
ctm and not in the text matrix"). I'm not using the third suggested change
because it has no effect. I also added a test for 90° and 270°, which work as
well.
was (Author: tilman):
This patch uses two of three of the changes of [~dtd0], but in a different way.
The last test file refutes my fear ("but what if the negative number is in the
ctm and not in the text matrix"). I'm not using the third suggested change
because it has no effect. However I'll investigate what happens with 90° and
270° rotated pages.
> PDFTextStripper adds extra word/line delimiters when PDF page orientation is
> 180 degrees
> ----------------------------------------------------------------------------------------
>
> Key: PDFBOX-2984
> URL: https://issues.apache.org/jira/browse/PDFBOX-2984
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.10, 1.8.11, 2.0.0
> Environment: Windows/Linux, JDK 1.7
> Reporter: dariusz dusberger
> Attachments: 1760_001.pdf, PDFBOX-2984-039354-180°.pdf,
> PDFBOX-2984-072206-180°.pdf, PDFBOX-2984-180°-bad.txt, PDFBOX-2984-180°.pdf,
> PDFStreamEngine.java, diff-to-1.8-rev-1594047.txt
>
>
> The PDFTextStripper adds word delimiters between each character and new-line
> after each word when page orientation is 180 degrees.
> This happens because the PDFStreamEngine uses the raw scaling factor
> Matrix.getXScale() from the transformation Matrix to scale width/font-size
> which are used to calculate spacing between characters.
> =========================================================
> Output of the PDFTextStripper.getText(pdDoc);
> T h i s i s
> a t e s t 1 ! ! !
> T h i s
> i s
> a t e s t
> 2
> ! ! !
> T h i s i s
> a
> t e s t 3
> ! ! !
> T h i s i s
> a t e s t 4 ! ! !
> =========================================================
> Example: The following will result in negative spaceWidthDisp / font-size in
> PDFTextStripper
> 180 degrees = [-1, 0, 0; 0, -1, 0, w, h, 1]; therefore the
> textMatrix.getXScale() == -1
> float spaceWidthDisp = spaceWidthText * fontSizeText * horizontalScalingText
> * textMatrix.getXScale() * ctm.getXScale()
> fontSizeText * textMatrix.getXScale()
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]