[ https://issues.apache.org/jira/browse/PDFBOX-5213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vladimir updated PDFBOX-5213: ----------------------------- Description: Since version 2.0.22 PDFTextStripper adds next line symbol after sup values. Like earlier "Other (12) 1,505 832" Now: "Other (12) 1,505 832" You can see this by comparing files GS-2010-q4-earnings.pdf_expected.html (2.0.21 and earlier) and GS-2010-q4-earnings.pdf_result.html (2.0.22 and higher) !image-2021-06-14-14-50-08-236.png! If I took latest version of PDFbox like 2.0.24 and copy code of PDFTextStripper from 2.0.21 and use it then I don't see this issue. So it's regression only in PDFTextStripper. To reproduce, you can use next simple code (copied from your examples). pageBytes is file GS-2010-q4-earnings.pdf List<String> pages = new ArrayList<>(); PDDocument pdDocument = null; try { String pass = ""; PDFParser parser = new PDFParser(new RandomAccessReadBuffer(pageBytes), pass); pdDocument = parser.parse(); int numberOfPages = pdDocument.getNumberOfPages(); if (limit < numberOfPages) { numberOfPages = limit; } // // for (int i = 0; i < numberOfPages; i++) { PDFTextStripper stripper = new PDFTextStripper(); stripper.setStartPage(i + 1); stripper.setEndPage(i + 1); pages.add(stripper.getText(pdDocument)); } } catch (Exception e) { log.error(e.getMessage(), e); } finally { if (pdDocument != null) { try { pdDocument.close(); } catch (IOException e) \{ log.error(e.getMessage(), e); } } } was: Since version 2.0.22 PDFTextStripper adds next line symbol after sup values. Like earlier "Other (12) 1,505 832" Now: "Other (12) 1,505 832" You can see this by comparing files GS-2010-q4-earnings.pdf_expected.html (2.0.21 and earlier) and GS-2010-q4-earnings.pdf_result.html (2.0.22 and higher) !image-2021-06-14-14-50-08-236.png! If I took latest version of PDFbox like 2.0.24 and copy code of PDFTextStripper from 2.0.21 and use it then I don't see this issue. So it's regression only in PDFTextStripper. To reproduce, toy can use next simple code (copied from your examples). pageBytes is file GS-2010-q4-earnings.pdf List<String> pages = new ArrayList<>(); PDDocument pdDocument = null; try { String pass = ""; PDFParser parser = new PDFParser(new RandomAccessReadBuffer(pageBytes), pass); pdDocument = parser.parse(); int numberOfPages = pdDocument.getNumberOfPages(); if (limit < numberOfPages) { numberOfPages = limit; } // // for (int i = 0; i < numberOfPages; i++) { PDFTextStripper stripper = new PDFTextStripper(); stripper.setStartPage(i + 1); stripper.setEndPage(i + 1); pages.add(stripper.getText(pdDocument)); } } catch (Exception e) { log.error(e.getMessage(), e); } finally { if (pdDocument != null) { try { pdDocument.close(); } catch (IOException e) \{ log.error(e.getMessage(), e); } } } > PDFTextStripper adds next line symbol after sup values (regression) > -------------------------------------------------------------------- > > Key: PDFBOX-5213 > URL: https://issues.apache.org/jira/browse/PDFBOX-5213 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 2.0.22, 2.0.23, 2.0.24 > Reporter: Vladimir > Priority: Minor > Fix For: 2.0.21 > > Attachments: GS-2010-q4-earnings.pdf, > GS-2010-q4-earnings.pdf_expected.html, GS-2010-q4-earnings.pdf_result.html, > image-2021-06-14-14-50-08-236.png > > > Since version 2.0.22 > PDFTextStripper adds next line symbol after sup values. > Like earlier > "Other (12) 1,505 832" > Now: > "Other (12) > 1,505 832" > > You can see this by comparing files GS-2010-q4-earnings.pdf_expected.html > (2.0.21 and earlier) and GS-2010-q4-earnings.pdf_result.html (2.0.22 and > higher) > !image-2021-06-14-14-50-08-236.png! > If I took latest version of PDFbox like 2.0.24 and copy code of > PDFTextStripper from 2.0.21 and use it then I don't see this issue. So it's > regression only in PDFTextStripper. > > To reproduce, you can use next simple code (copied from your examples). > pageBytes is file GS-2010-q4-earnings.pdf > List<String> pages = new ArrayList<>(); > PDDocument pdDocument = null; > try { > String pass = ""; > PDFParser parser = new PDFParser(new RandomAccessReadBuffer(pageBytes), > pass); > pdDocument = parser.parse(); > int numberOfPages = pdDocument.getNumberOfPages(); > if (limit < numberOfPages) > { numberOfPages = limit; } > // // > for (int i = 0; i < numberOfPages; i++) > { PDFTextStripper stripper = new PDFTextStripper(); stripper.setStartPage(i + > 1); stripper.setEndPage(i + 1); pages.add(stripper.getText(pdDocument)); } > } catch (Exception e) > { log.error(e.getMessage(), e); } > finally { > if (pdDocument != null) > { try > { pdDocument.close(); } > catch (IOException e) \{ log.error(e.getMessage(), e); } > } > } > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org