[ 
https://issues.apache.org/jira/browse/PDFBOX-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15441563#comment-15441563
 ] 

Tilman Hausherr commented on PDFBOX-2984:
-----------------------------------------

This patch uses two of three of the changes of [~dtd0], but in a different way. 
The last test file refutes my fear ("but what if the negative number is in the 
ctm and not in the text matrix"). I'm not using the third suggested change 
because it has no effect. However I'll investigate what happens with 90° and 
270° rotated pages.

> PDFTextStripper adds extra word/line delimiters when PDF page orientation is 
> 180 degrees
> ----------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-2984
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2984
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.10, 1.8.11, 2.0.0
>         Environment: Windows/Linux, JDK 1.7
>            Reporter: dariusz dusberger
>         Attachments: 1760_001.pdf, PDFBOX-2984-039354-180°.pdf, 
> PDFBOX-2984-072206-180°.pdf, PDFBOX-2984-180°-bad.txt, PDFBOX-2984-180°.pdf, 
> PDFStreamEngine.java, diff-to-1.8-rev-1594047.txt
>
>
> The PDFTextStripper adds word delimiters between each character and new-line 
> after each word when page orientation is 180 degrees. 
> This happens because the PDFStreamEngine uses the raw scaling factor 
> Matrix.getXScale() from the transformation Matrix to scale width/font-size 
> which are used to calculate spacing between characters.
> =========================================================
> Output of the PDFTextStripper.getText(pdDoc);
> T h i s  i s  
> a  t e s t  1  ! ! !
> T h i s  
> i s  
> a  t e s t  
> 2  
> ! ! !
> T h i s  i s  
> a  
> t e s t  3  
> ! ! !
> T h i s  i s  
> a  t e s t  4 ! ! !
> =========================================================
> Example: The following will result in negative spaceWidthDisp  / font-size in 
> PDFTextStripper
> 180 degrees = [-1, 0, 0; 0, -1, 0, w, h, 1]; therefore the 
> textMatrix.getXScale() == -1
> float spaceWidthDisp = spaceWidthText * fontSizeText * horizontalScalingText 
> * textMatrix.getXScale() * ctm.getXScale()
> fontSizeText * textMatrix.getXScale()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to