dariusz dusberger created PDFBOX-2984:
-----------------------------------------

             Summary: PDFTextStripper adds word/line delimiters when PDF page 
orientation is 180 degrees
                 Key: PDFBOX-2984
                 URL: https://issues.apache.org/jira/browse/PDFBOX-2984
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.8.10
         Environment: Windows/Linux, JDK 1.7
            Reporter: dariusz dusberger


The PDFTextStripper adds word delimiters between each character and new-line 
after each word when page orientation is 180 degrees. 

This happens because the PDFStreamEngine uses the raw scaling factor 
Matrix.getXScale() from the transformation Matrix to scale width/font-size 
which are used to calculate spacing between characters.

=========================================================
Output of the PDFTextStripper.getText(pdDoc);

T h i s  i s  
a  t e s t  1  ! ! !
T h i s  
i s  
a  t e s t  
2  
! ! !
T h i s  i s  
a  
t e s t  3  
! ! !
T h i s  i s  
a  t e s t  4 ! ! !

=========================================================
Example: The following will result in negative spaceWidthDisp  / font-size in 
PDFTextStripper

180 degrees = [-1, 0, 0; 0, -1, 0, w, h, 1]; therefore the 
textMatrix.getXScale() == -1

float spaceWidthDisp = spaceWidthText * fontSizeText * horizontalScalingText * 
textMatrix.getXScale() * ctm.getXScale()

fontSizeText * textMatrix.getXScale()





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to