dariusz dusberger created PDFBOX-2984:
-----------------------------------------
Summary: PDFTextStripper adds word/line delimiters when PDF page
orientation is 180 degrees
Key: PDFBOX-2984
URL: https://issues.apache.org/jira/browse/PDFBOX-2984
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.8.10
Environment: Windows/Linux, JDK 1.7
Reporter: dariusz dusberger
The PDFTextStripper adds word delimiters between each character and new-line
after each word when page orientation is 180 degrees.
This happens because the PDFStreamEngine uses the raw scaling factor
Matrix.getXScale() from the transformation Matrix to scale width/font-size
which are used to calculate spacing between characters.
=========================================================
Output of the PDFTextStripper.getText(pdDoc);
T h i s i s
a t e s t 1 ! ! !
T h i s
i s
a t e s t
2
! ! !
T h i s i s
a
t e s t 3
! ! !
T h i s i s
a t e s t 4 ! ! !
=========================================================
Example: The following will result in negative spaceWidthDisp / font-size in
PDFTextStripper
180 degrees = [-1, 0, 0; 0, -1, 0, w, h, 1]; therefore the
textMatrix.getXScale() == -1
float spaceWidthDisp = spaceWidthText * fontSizeText * horizontalScalingText *
textMatrix.getXScale() * ctm.getXScale()
fontSizeText * textMatrix.getXScale()
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]