[ 
https://issues.apache.org/jira/browse/PDFBOX-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2984:
------------------------------------
    Attachment: diff-to-1.8-rev-1594047.txt

I've attached a diff to the revision you were working with. This isn't a real 
diff as required by us, but it shows your changes, which were more than just 
the one you described.

Your change is probably going in the right direction, but what if the negative 
number is in the ctm and not in the text matrix? And would the change have an 
influence on RTL texts?

So I'm not doing anything now, but I have added your file to my test set. We'll 
probably have a closer look at the text extraction issues after releasing 2.0.

> PDFTextStripper adds extra word/line delimiters when PDF page orientation is 
> 180 degrees
> ----------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-2984
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2984
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.10, 1.8.11, 2.0.0
>         Environment: Windows/Linux, JDK 1.7
>            Reporter: dariusz dusberger
>         Attachments: 1760_001.pdf, PDFStreamEngine.java, 
> diff-to-1.8-rev-1594047.txt
>
>
> The PDFTextStripper adds word delimiters between each character and new-line 
> after each word when page orientation is 180 degrees. 
> This happens because the PDFStreamEngine uses the raw scaling factor 
> Matrix.getXScale() from the transformation Matrix to scale width/font-size 
> which are used to calculate spacing between characters.
> =========================================================
> Output of the PDFTextStripper.getText(pdDoc);
> T h i s  i s  
> a  t e s t  1  ! ! !
> T h i s  
> i s  
> a  t e s t  
> 2  
> ! ! !
> T h i s  i s  
> a  
> t e s t  3  
> ! ! !
> T h i s  i s  
> a  t e s t  4 ! ! !
> =========================================================
> Example: The following will result in negative spaceWidthDisp  / font-size in 
> PDFTextStripper
> 180 degrees = [-1, 0, 0; 0, -1, 0, w, h, 1]; therefore the 
> textMatrix.getXScale() == -1
> float spaceWidthDisp = spaceWidthText * fontSizeText * horizontalScalingText 
> * textMatrix.getXScale() * ctm.getXScale()
> fontSizeText * textMatrix.getXScale()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to