[jira] [Commented] (PDFBOX-5688) PDFTextStripper not returning full text of document

Michael Klink (Jira) Fri, 29 Sep 2023 09:59:09 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17770514#comment-17770514
 ]


Michael Klink commented on PDFBOX-5688:
---------------------------------------

The PDF does not contain the information required to extract text.

There is a font and it contains some glyph drawing instructions for a number of 
character codes. Thus, you see plain text. What is missing, though, is a 
mapping from those character codes or those glyph drawing instructions to a 
character in a known encoding like Unicode. Thus, PDFBox has to guess such a 
mapping but the guess goes wrong.

(If you do a regular copy&paste in Adobe Acrobat, the result also is garbage, 
by the way. That document has been created without the intent to allow text 
extraction, maybe even with the intent to make text extraction difficult.)

> PDFTextStripper not returning full text of document
> ---------------------------------------------------
>
>                 Key: PDFBOX-5688
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5688
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.29
>            Reporter: Joseph Jezerinac
>            Priority: Major
>         Attachments: pdfbox-get-text-problem.pdf
>
>
>  
> {code:java}
> try (PDDocument document = PDDocument.load(pdfFile) {
>   String text = new PDFTextStripper().getText();
> } {code}
> > get text above is only returning few chars \n, \r, 0, 1 but opening the PDF 
> > in chrome / acrobat it appears have lots of text.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5688) PDFTextStripper not returning full text of document

Reply via email to