Igor created PDFBOX-4986:
----------------------------
Summary: Text can't be extracted from a document
Key: PDFBOX-4986
URL: https://issues.apache.org/jira/browse/PDFBOX-4986
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 2.0.21
Environment: Windows 10, AdoptOpenJDK 11.0.8, 64-bit
Reporter: Igor
Attachments: c0015_re_1375881383129_eng[1].pdf
Hello everyone,
PDFBox is not able to extract text from the attached document. It can only
extract the first page with "Please wait...". The other pages are missing. I've
also tried loading it in PDFDebugger, but it shows the first page only. I can
open the document fine in Adobe and see all the text fine. I suspect it's some
kind of dynamically generated content.
Sample code to reproduce the issue:
{code:java}
try (PDDocument document = PDDocument.load(new
File("c0015_re_1375881383129_eng[1].pdf"), "")) {
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);
System.out.println("Text: " + text);
}
{code}
Thanks.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]