[
https://issues.apache.org/jira/browse/PDFBOX-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr resolved PDFBOX-2984.
-------------------------------------
Resolution: Fixed
Assignee: Tilman Hausherr
Fix Version/s: 2.1.0
2.0.3
> PDFTextStripper adds extra word/line delimiters when PDF page orientation is
> 180 degrees
> ----------------------------------------------------------------------------------------
>
> Key: PDFBOX-2984
> URL: https://issues.apache.org/jira/browse/PDFBOX-2984
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.10, 1.8.11, 2.0.0, 2.0.2, 2.0.3, 2.1.0
> Environment: Windows/Linux, JDK 1.7
> Reporter: dariusz dusberger
> Assignee: Tilman Hausherr
> Fix For: 2.0.3, 2.1.0
>
> Attachments: 1760_001.pdf, PDFBOX-2984-039354-180°.pdf,
> PDFBOX-2984-072206-180°.pdf, PDFBOX-2984-180°-bad.txt, PDFBOX-2984-180°.pdf,
> PDFStreamEngine.java, diff-to-1.8-rev-1594047.txt
>
>
> The PDFTextStripper adds word delimiters between each character and new-line
> after each word when page orientation is 180 degrees.
> This happens because the PDFStreamEngine uses the raw scaling factor
> Matrix.getXScale() from the transformation Matrix to scale width/font-size
> which are used to calculate spacing between characters.
> =========================================================
> Output of the PDFTextStripper.getText(pdDoc);
> T h i s i s
> a t e s t 1 ! ! !
> T h i s
> i s
> a t e s t
> 2
> ! ! !
> T h i s i s
> a
> t e s t 3
> ! ! !
> T h i s i s
> a t e s t 4 ! ! !
> =========================================================
> Example: The following will result in negative spaceWidthDisp / font-size in
> PDFTextStripper
> 180 degrees = [-1, 0, 0; 0, -1, 0, w, h, 1]; therefore the
> textMatrix.getXScale() == -1
> float spaceWidthDisp = spaceWidthText * fontSizeText * horizontalScalingText
> * textMatrix.getXScale() * ctm.getXScale()
> fontSizeText * textMatrix.getXScale()
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]