[
https://issues.apache.org/jira/browse/PDFBOX-5213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vladimir updated PDFBOX-5213:
-----------------------------
Description:
Since version 2.0.22
PDFTextStripper adds next line symbol after sup values.
Like earlier
"Other (12) 1,505 832"
Now:
"Other (12)
1,505 832"
You can see this by comparing files GS-2010-q4-earnings.pdf_expected.html
(2.0.21 and earlier) and GS-2010-q4-earnings.pdf_result.html (2.0.22 and higher)
!image-2021-06-14-14-50-08-236.png!
If I took latest version of PDFbox like 2.0.24 and copy code of PDFTextStripper
from 2.0.21 and use it then I don't see this issue. So it's regression only in
PDFTextStripper.
To reproduce, you can use next simple code (copied from your examples).
pageBytes is file GS-2010-q4-earnings.pdf
List<String> pages = new ArrayList<>();
PDDocument pdDocument = null;
try {
String pass = "";
PDFParser parser = new PDFParser(new RandomAccessReadBuffer(pageBytes), pass);
pdDocument = parser.parse();
int numberOfPages = pdDocument.getNumberOfPages();
if (limit < numberOfPages)
{ numberOfPages = limit; }
// //
for (int i = 0; i < numberOfPages; i++)
{ PDFTextStripper stripper = new PDFTextStripper(); stripper.setStartPage(i +
1); stripper.setEndPage(i + 1); pages.add(stripper.getText(pdDocument)); }
} catch (Exception e)
{ log.error(e.getMessage(), e); }
finally {
if (pdDocument != null)
{ try
{ pdDocument.close(); }
catch (IOException e) \{ log.error(e.getMessage(), e); }
}
}
was:
Since version 2.0.22
PDFTextStripper adds next line symbol after sup values.
Like earlier
"Other (12) 1,505 832"
Now:
"Other (12)
1,505 832"
You can see this by comparing files GS-2010-q4-earnings.pdf_expected.html
(2.0.21 and earlier) and GS-2010-q4-earnings.pdf_result.html (2.0.22 and higher)
!image-2021-06-14-14-50-08-236.png!
If I took latest version of PDFbox like 2.0.24 and copy code of PDFTextStripper
from 2.0.21 and use it then I don't see this issue. So it's regression only in
PDFTextStripper.
To reproduce, toy can use next simple code (copied from your examples).
pageBytes is file GS-2010-q4-earnings.pdf
List<String> pages = new ArrayList<>();
PDDocument pdDocument = null;
try {
String pass = "";
PDFParser parser = new PDFParser(new RandomAccessReadBuffer(pageBytes), pass);
pdDocument = parser.parse();
int numberOfPages = pdDocument.getNumberOfPages();
if (limit < numberOfPages)
{ numberOfPages = limit; }
// //
for (int i = 0; i < numberOfPages; i++)
{ PDFTextStripper stripper = new PDFTextStripper(); stripper.setStartPage(i +
1); stripper.setEndPage(i + 1); pages.add(stripper.getText(pdDocument)); }
} catch (Exception e)
{ log.error(e.getMessage(), e); }
finally {
if (pdDocument != null)
{ try
{ pdDocument.close(); }
catch (IOException e) \{ log.error(e.getMessage(), e); }
}
}
> PDFTextStripper adds next line symbol after sup values (regression)
> --------------------------------------------------------------------
>
> Key: PDFBOX-5213
> URL: https://issues.apache.org/jira/browse/PDFBOX-5213
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.22, 2.0.23, 2.0.24
> Reporter: Vladimir
> Priority: Minor
> Fix For: 2.0.21
>
> Attachments: GS-2010-q4-earnings.pdf,
> GS-2010-q4-earnings.pdf_expected.html, GS-2010-q4-earnings.pdf_result.html,
> image-2021-06-14-14-50-08-236.png
>
>
> Since version 2.0.22
> PDFTextStripper adds next line symbol after sup values.
> Like earlier
> "Other (12) 1,505 832"
> Now:
> "Other (12)
> 1,505 832"
>
> You can see this by comparing files GS-2010-q4-earnings.pdf_expected.html
> (2.0.21 and earlier) and GS-2010-q4-earnings.pdf_result.html (2.0.22 and
> higher)
> !image-2021-06-14-14-50-08-236.png!
> If I took latest version of PDFbox like 2.0.24 and copy code of
> PDFTextStripper from 2.0.21 and use it then I don't see this issue. So it's
> regression only in PDFTextStripper.
>
> To reproduce, you can use next simple code (copied from your examples).
> pageBytes is file GS-2010-q4-earnings.pdf
> List<String> pages = new ArrayList<>();
> PDDocument pdDocument = null;
> try {
> String pass = "";
> PDFParser parser = new PDFParser(new RandomAccessReadBuffer(pageBytes),
> pass);
> pdDocument = parser.parse();
> int numberOfPages = pdDocument.getNumberOfPages();
> if (limit < numberOfPages)
> { numberOfPages = limit; }
> // //
> for (int i = 0; i < numberOfPages; i++)
> { PDFTextStripper stripper = new PDFTextStripper(); stripper.setStartPage(i +
> 1); stripper.setEndPage(i + 1); pages.add(stripper.getText(pdDocument)); }
> } catch (Exception e)
> { log.error(e.getMessage(), e); }
> finally {
> if (pdDocument != null)
> { try
> { pdDocument.close(); }
> catch (IOException e) \{ log.error(e.getMessage(), e); }
> }
> }
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]