[ https://issues.apache.org/jira/browse/PDFBOX-6020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tilman Hausherr updated PDFBOX-6020: ------------------------------------ Description: I'm currently upgrading our usage of PDFBox from Version 2.0.19 to 3.0.5. Worked out fine so far but one of our JUnit-Tests failed with the new version. During text extraction by using PDFTextStripper unnecessary line feeds were created for a line, that contained subscript as well as superscript text. While debugging the issue I found some changes that were made in Methode PDFTextStripper.writePage(). I think maxYForLine, maxHeightForLine and minYTopForLine, which are used for the overlap-check, are reset too often. There's a check made with the value of 'Math.abs(position.getX() - lastPosition.getTextPosition().getX())'. But I think is might have to be changed to 'Math.abs(position.getX() - (lastPosition.getTextPosition().getX() + lastPosition.getTextPosition().getWidth()))' to find relevant gaps. An example-PDF can be downloaded from [here|https://patentimages.storage.googleapis.com/57/b2/2f/3b5ffe86d83ef5/DE102016007628A1.pdf] The text-line we had problems with was on page 2: 'gin-Anion (12-Wolframato-1-phosphat) kann durch die Summenformel [P(W12O40)]3- beschrieben werden oder'. was: I'm currently upgrading our usage of PDFBox from Version 2.0.19 to 3.0.5. Worked out fine so far but one of our JUnit-Tests failed with the new version. During text extraction by using PDFTextStripper unnecessary line feeds were created for a line, that contained subscript as well as superscript text. While debugging the issue I found some changes that were made in Methode PDFTextStripper.writePage(). I think maxYForLine, maxHeightForLine and minYTopForLine, which are used for the overlap-check, are reset too often. There's a check made with the value of 'Math.abs(position.getX() - lastPosition.getTextPosition().getX())'. But I think is might have to be changed to 'Math.abs(position.getX() - (lastPosition.getTextPosition().getX() + lastPosition.getTextPosition().getWidth()))' to find relevant gaps. An example-PDF can be downloaded from [https://patentimages.storage.googleapis.com/57/b2/2f/3b5ffe86d83ef5/DE102016007628A1.pdf|https://deref-gmx.net/mail/client/1zP7-96Q_X4/dereferrer/?redirectUrl=https%3A%2F%2Fpatentimages.storage.googleapis.com%2F57%2Fb2%2F2f%2F3b5ffe86d83ef5%2FDE102016007628A1.pdf&lm] The text-line we had problems with was on page 2: 'gin-Anion (12-Wolframato-1-phosphat) kann durch die Summenformel [P(W12O40)]3- beschrieben werden oder'. > mix of subscript and superscript can lead to unnecessary new lines during > text extraction > ----------------------------------------------------------------------------------------- > > Key: PDFBOX-6020 > URL: https://issues.apache.org/jira/browse/PDFBOX-6020 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 3.0.5 PDFBox > Reporter: Markus Seifert > Priority: Minor > > I'm currently upgrading our usage of PDFBox from Version 2.0.19 to 3.0.5. > Worked out fine so far but one of our JUnit-Tests failed with the new > version. During text extraction by using PDFTextStripper unnecessary line > feeds were created for a line, that contained subscript as well as > superscript text. While debugging the issue I found some changes that were > made in Methode PDFTextStripper.writePage(). I think maxYForLine, > maxHeightForLine and minYTopForLine, which are used for the overlap-check, > are reset too often. > > There's a check made with the value of 'Math.abs(position.getX() - > lastPosition.getTextPosition().getX())'. But I think is might have to be > changed to 'Math.abs(position.getX() - (lastPosition.getTextPosition().getX() > + lastPosition.getTextPosition().getWidth()))' to find relevant gaps. > > An example-PDF can be downloaded from > [here|https://patentimages.storage.googleapis.com/57/b2/2f/3b5ffe86d83ef5/DE102016007628A1.pdf] > > The text-line we had problems with was on page 2: 'gin-Anion > (12-Wolframato-1-phosphat) kann durch die Summenformel [P(W12O40)]3- > beschrieben werden oder'. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org