[
https://issues.apache.org/jira/browse/PDFBOX-4986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213495#comment-17213495
]
Igor commented on PDFBOX-4986:
------------------------------
Thanks for the info! I suspected that it would be something like that. No
worries, I will look into parsing the XFA content in my app.
> Text can't be extracted from a document
> ---------------------------------------
>
> Key: PDFBOX-4986
> URL: https://issues.apache.org/jira/browse/PDFBOX-4986
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.21
> Environment: Windows 10, AdoptOpenJDK 11.0.8, 64-bit
> Reporter: Igor
> Priority: Major
> Attachments: c0015_re_1375881383129_eng[1].pdf, screenshot-1.png
>
>
> Hello everyone,
>
> PDFBox is not able to extract text from the attached document. It can only
> extract the first page with "Please wait...". The other pages are missing.
> I've also tried loading it in PDFDebugger, but it shows the first page only.
> I can open the document fine in Adobe and see all the text fine. I suspect
> it's some kind of dynamically generated content.
>
> Sample code to reproduce the issue:
> {code:java}
> try (PDDocument document = PDDocument.load(new
> File("c0015_re_1375881383129_eng[1].pdf"), "")) {
> PDFTextStripper stripper = new PDFTextStripper();
> String text = stripper.getText(document);
> System.out.println("Text: " + text);
> }
> {code}
>
> Thanks.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]