[ 
https://issues.apache.org/jira/browse/PDFBOX-5213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir updated PDFBOX-5213:
-----------------------------
    Description: 
Since version 2.0.22

PDFTextStripper adds next line symbol after sup values.

Like earlier

"Other (12) 1,505 832"

Now:

"Other (12)
 1,505 832"

 

You can see this by comparing files GS-2010-q4-earnings.pdf_expected.html 
(2.0.21 and earlier) and GS-2010-q4-earnings.pdf_result.html (2.0.22 and higher)

  !image-2021-06-14-14-50-08-236.png!

If I took latest version of PDFbox like 2.0.24 and copy code of PDFTextStripper 
from 2.0.21 and use it then I don't see this issue. So it's regression only in 
PDFTextStripper.

 

To reproduce, you can use next simple code (copied from your examples). 
pageBytes is file GS-2010-q4-earnings.pdf

List<String> pages = new ArrayList<>();

PDDocument pdDocument = null;
 try {
 String pass = "";
 PDFParser parser = new PDFParser(new RandomAccessReadBuffer(pageBytes), pass);
 pdDocument = parser.parse();

int numberOfPages = pdDocument.getNumberOfPages();
 if (limit < numberOfPages)

{ numberOfPages = limit; }

// //

for (int i = 0; i < numberOfPages; i++)

{ PDFTextStripper stripper = new PDFTextStripper(); stripper.setStartPage(i + 
1); stripper.setEndPage(i + 1); pages.add(stripper.getText(pdDocument)); }

} catch (Exception e)

{ log.error(e.getMessage(), e);  }

finally {
 if (pdDocument != null)

{ try

{ pdDocument.close(); }

catch (IOException e) \{ log.error(e.getMessage(), e);  }

}
 }

 

 

 

  was:
Since version 2.0.22

PDFTextStripper adds next line symbol after sup values.

Like earlier

"Other (12) 1,505 832"

Now:

"Other (12)
 1,505 832"

 

You can see this by comparing files GS-2010-q4-earnings.pdf_expected.html 
(2.0.21 and earlier) and GS-2010-q4-earnings.pdf_result.html (2.0.22 and higher)

  !image-2021-06-14-14-50-08-236.png!

If I took latest version of PDFbox like 2.0.24 and copy code of PDFTextStripper 
from 2.0.21 and use it then I don't see this issue. So it's regression only in 
PDFTextStripper.

 

To reproduce, toy can use next simple code (copied from your examples). 
pageBytes is file GS-2010-q4-earnings.pdf

List<String> pages = new ArrayList<>();

PDDocument pdDocument = null;
 try {
 String pass = "";
 PDFParser parser = new PDFParser(new RandomAccessReadBuffer(pageBytes), pass);
 pdDocument = parser.parse();

int numberOfPages = pdDocument.getNumberOfPages();
 if (limit < numberOfPages)

{ numberOfPages = limit; }

// //

for (int i = 0; i < numberOfPages; i++)

{ PDFTextStripper stripper = new PDFTextStripper(); stripper.setStartPage(i + 
1); stripper.setEndPage(i + 1); pages.add(stripper.getText(pdDocument)); }

} catch (Exception e)

{ log.error(e.getMessage(), e);  }

finally {
 if (pdDocument != null)

{ try

{ pdDocument.close(); }

catch (IOException e) \{ log.error(e.getMessage(), e);  }

}
 }

 

 

 


> PDFTextStripper adds next line symbol after sup values (regression) 
> --------------------------------------------------------------------
>
>                 Key: PDFBOX-5213
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5213
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.22, 2.0.23, 2.0.24
>            Reporter: Vladimir
>            Priority: Minor
>             Fix For: 2.0.21
>
>         Attachments: GS-2010-q4-earnings.pdf, 
> GS-2010-q4-earnings.pdf_expected.html, GS-2010-q4-earnings.pdf_result.html, 
> image-2021-06-14-14-50-08-236.png
>
>
> Since version 2.0.22
> PDFTextStripper adds next line symbol after sup values.
> Like earlier
> "Other (12) 1,505 832"
> Now:
> "Other (12)
>  1,505 832"
>  
> You can see this by comparing files GS-2010-q4-earnings.pdf_expected.html 
> (2.0.21 and earlier) and GS-2010-q4-earnings.pdf_result.html (2.0.22 and 
> higher)
>   !image-2021-06-14-14-50-08-236.png!
> If I took latest version of PDFbox like 2.0.24 and copy code of 
> PDFTextStripper from 2.0.21 and use it then I don't see this issue. So it's 
> regression only in PDFTextStripper.
>  
> To reproduce, you can use next simple code (copied from your examples). 
> pageBytes is file GS-2010-q4-earnings.pdf
> List<String> pages = new ArrayList<>();
> PDDocument pdDocument = null;
>  try {
>  String pass = "";
>  PDFParser parser = new PDFParser(new RandomAccessReadBuffer(pageBytes), 
> pass);
>  pdDocument = parser.parse();
> int numberOfPages = pdDocument.getNumberOfPages();
>  if (limit < numberOfPages)
> { numberOfPages = limit; }
> // //
> for (int i = 0; i < numberOfPages; i++)
> { PDFTextStripper stripper = new PDFTextStripper(); stripper.setStartPage(i + 
> 1); stripper.setEndPage(i + 1); pages.add(stripper.getText(pdDocument)); }
> } catch (Exception e)
> { log.error(e.getMessage(), e);  }
> finally {
>  if (pdDocument != null)
> { try
> { pdDocument.close(); }
> catch (IOException e) \{ log.error(e.getMessage(), e);  }
> }
>  }
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to