[ 
https://issues.apache.org/jira/browse/PDFBOX-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15441447#comment-15441447
 ] 

Tilman Hausherr edited comment on PDFBOX-2984 at 8/27/16 1:39 PM:
------------------------------------------------------------------

PDFBOX-2984-039354-180°.pdf and PDFBOX-2984-072206-180°.pdf are real world 
files from the GovDocs site. PDFBOX-2984-180°.pdf I created myself with this 
code, to confirm or refute my fear ("but what if the negative number is in the 
ctm and not in the text matrix") from a year ago.
{code}
        try (PDDocument document = new PDDocument())
        {
            PDPage page = new PDPage(PDRectangle.A4);
            page.setRotation(180);
            document.addPage(page);

            try (PDPageContentStream cs = new PDPageContentStream(document, 
page))
            {
                cs.beginText();
                cs.setLeading(50);
                cs.setFont(PDType1Font.HELVETICA, 50);
                cs.setTextMatrix(Matrix.getRotateInstance(Math.toRadians(180), 
500, 100));
                cs.showText("Apache PDFBox ®");
                cs.setFont(PDType1Font.HELVETICA, 12);
                cs.newLine();
                cs.showText("180° text matrix");
                cs.endText();
            }

            page = new PDPage(PDRectangle.A4);
            page.setRotation(180);
            document.addPage(page);

            try (PDPageContentStream cs = new PDPageContentStream(document, 
page))
            {
                cs.transform(Matrix.getRotateInstance(Math.toRadians(180), 500, 
100));
                cs.beginText();
                cs.setLeading(50);
                cs.setFont(PDType1Font.HELVETICA, 50);
                cs.showText("Apache PDFBox ®");
                cs.setFont(PDType1Font.HELVETICA, 12);
                cs.newLine();
                cs.showText("180° ctm");
                cs.endText();
            }

            document.save("PDFBOX-2984-180°.pdf");
        }
{code}



was (Author: tilman):
PDFBOX-2984-039354-180°.pdf and PDFBOX-2984-072206-180°.pdf are real world 
files from the GovDocs site. PDFBOX-2984-180°.pdf I created myself with this 
code, to prove or disprove my comment from a year ago.
{code}
        try (PDDocument document = new PDDocument())
        {
            PDPage page = new PDPage(PDRectangle.A4);
            page.setRotation(180);
            document.addPage(page);

            try (PDPageContentStream cs = new PDPageContentStream(document, 
page))
            {
                cs.beginText();
                cs.setLeading(50);
                cs.setFont(PDType1Font.HELVETICA, 50);
                cs.setTextMatrix(Matrix.getRotateInstance(Math.toRadians(180), 
500, 100));
                cs.showText("Apache PDFBox ®");
                cs.setFont(PDType1Font.HELVETICA, 12);
                cs.newLine();
                cs.showText("180° text matrix");
                cs.endText();
            }

            page = new PDPage(PDRectangle.A4);
            page.setRotation(180);
            document.addPage(page);

            try (PDPageContentStream cs = new PDPageContentStream(document, 
page))
            {
                cs.transform(Matrix.getRotateInstance(Math.toRadians(180), 500, 
100));
                cs.beginText();
                cs.setLeading(50);
                cs.setFont(PDType1Font.HELVETICA, 50);
                cs.showText("Apache PDFBox ®");
                cs.setFont(PDType1Font.HELVETICA, 12);
                cs.newLine();
                cs.showText("180° ctm");
                cs.endText();
            }

            document.save("PDFBOX-2984-180°.pdf");
        }
{code}


> PDFTextStripper adds extra word/line delimiters when PDF page orientation is 
> 180 degrees
> ----------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-2984
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2984
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.10, 1.8.11, 2.0.0
>         Environment: Windows/Linux, JDK 1.7
>            Reporter: dariusz dusberger
>         Attachments: 1760_001.pdf, PDFBOX-2984-039354-180°.pdf, 
> PDFBOX-2984-072206-180°.pdf, PDFBOX-2984-180°-bad.txt, PDFBOX-2984-180°.pdf, 
> PDFStreamEngine.java, diff-to-1.8-rev-1594047.txt
>
>
> The PDFTextStripper adds word delimiters between each character and new-line 
> after each word when page orientation is 180 degrees. 
> This happens because the PDFStreamEngine uses the raw scaling factor 
> Matrix.getXScale() from the transformation Matrix to scale width/font-size 
> which are used to calculate spacing between characters.
> =========================================================
> Output of the PDFTextStripper.getText(pdDoc);
> T h i s  i s  
> a  t e s t  1  ! ! !
> T h i s  
> i s  
> a  t e s t  
> 2  
> ! ! !
> T h i s  i s  
> a  
> t e s t  3  
> ! ! !
> T h i s  i s  
> a  t e s t  4 ! ! !
> =========================================================
> Example: The following will result in negative spaceWidthDisp  / font-size in 
> PDFTextStripper
> 180 degrees = [-1, 0, 0; 0, -1, 0, w, h, 1]; therefore the 
> textMatrix.getXScale() == -1
> float spaceWidthDisp = spaceWidthText * fontSizeText * horizontalScalingText 
> * textMatrix.getXScale() * ctm.getXScale()
> fontSizeText * textMatrix.getXScale()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to