[jira] [Created] (PDFBOX-4986) Text can't be extracted from a document

Igor (Jira) Sun, 11 Oct 2020 18:57:42 -0700

Igor created PDFBOX-4986:
----------------------------

             Summary: Text can't be extracted from a document
                 Key: PDFBOX-4986
                 URL: https://issues.apache.org/jira/browse/PDFBOX-4986
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 2.0.21
         Environment: Windows 10, AdoptOpenJDK 11.0.8, 64-bit
            Reporter: Igor
         Attachments: c0015_re_1375881383129_eng[1].pdf


Hello everyone,

 

PDFBox is not able to extract text from the attached document. It can only 
extract the first page with "Please wait...". The other pages are missing. I've 
also tried loading it in PDFDebugger, but it shows the first page only. I can 
open the document fine in Adobe and see all the text fine. I suspect it's some 
kind of dynamically generated content.

 

Sample code to reproduce the issue:
{code:java}
try (PDDocument document = PDDocument.load(new 
File("c0015_re_1375881383129_eng[1].pdf"), "")) {
        PDFTextStripper stripper = new PDFTextStripper();
        String text = stripper.getText(document);
        System.out.println("Text: " + text);
}
{code}
 

Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (PDFBOX-4986) Text can't be extracted from a document

Reply via email to