Markus Seifert created PDFBOX-6020:
--------------------------------------

             Summary: mix of subscript and superscript can lead to unnecessary 
new lines during text extraction
                 Key: PDFBOX-6020
                 URL: https://issues.apache.org/jira/browse/PDFBOX-6020
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 3.0.5 PDFBox
            Reporter: Markus Seifert


I'm currently upgrading our usage of PDFBox from Version 2.0.19 to 3.0.5. 
Worked out fine so far but one of our JUnit-Tests failed with the new version. 
During text extraction by using PDFTextStripper unnecessary line feeds were 
created for a line, that contained subscript as well as superscript text. While 
debugging the issue I found some changes that were made in Methode 
PDFTextStripper.writePage(). I think maxYForLine, maxHeightForLine and 
minYTopForLine, which are used for the overlap-check, are reset too often.

 

There's a check made with the value of 'Math.abs(position.getX() - 
lastPosition.getTextPosition().getX())'. But I think is might have to be 
changed to 'Math.abs(position.getX() - (lastPosition.getTextPosition().getX() + 
lastPosition.getTextPosition().getWidth()))' to find relevant gaps.

 

An example-PDF can be downloaded from 
[https://patentimages.storage.googleapis.com/57/b2/2f/3b5ffe86d83ef5/DE102016007628A1.pdf|https://deref-gmx.net/mail/client/1zP7-96Q_X4/dereferrer/?redirectUrl=https%3A%2F%2Fpatentimages.storage.googleapis.com%2F57%2Fb2%2F2f%2F3b5ffe86d83ef5%2FDE102016007628A1.pdf&lm]

 

The text-line we had problems with was on page 2: 'gin-Anion 
(12-Wolframato-1-phosphat) kann durch die Summenformel [P(W12O40)]3- 
beschrieben werden oder'. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to