[jira] [Commented] (PDFBOX-4986) Text can't be extracted from a document

Maruan Sahyoun (Jira) Tue, 13 Oct 2020 11:43:21 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213310#comment-17213310
 ]


Maruan Sahyoun commented on PDFBOX-4986:
----------------------------------------

Let me add to it. The document is constructed at run time from the {{template}} 
in the XFA and the {{dataset}} within a reader application or rendering engine 
supporting that.. You can get the {{XFA}} using 
{{PDAcroForm.getXFA().getDocument()}} which will provide you with a XML 
Document which you can parse but this is only the base definition of the XFA 
not what you see at run time.

> Text can't be extracted from a document
> ---------------------------------------
>
>                 Key: PDFBOX-4986
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4986
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.21
>         Environment: Windows 10, AdoptOpenJDK 11.0.8, 64-bit
>            Reporter: Igor
>            Priority: Major
>         Attachments: c0015_re_1375881383129_eng[1].pdf, screenshot-1.png
>
>
> Hello everyone,
>  
> PDFBox is not able to extract text from the attached document. It can only 
> extract the first page with "Please wait...". The other pages are missing. 
> I've also tried loading it in PDFDebugger, but it shows the first page only. 
> I can open the document fine in Adobe and see all the text fine. I suspect 
> it's some kind of dynamically generated content.
>  
> Sample code to reproduce the issue:
> {code:java}
> try (PDDocument document = PDDocument.load(new 
> File("c0015_re_1375881383129_eng[1].pdf"), "")) {
>       PDFTextStripper stripper = new PDFTextStripper();
>       String text = stripper.getText(document);
>       System.out.println("Text: " + text);
> }
> {code}
>  
> Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-4986) Text can't be extracted from a document

Reply via email to