[
https://issues.apache.org/jira/browse/PDFBOX-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032533#comment-17032533
]
Tilman Hausherr commented on PDFBOX-4768:
-----------------------------------------
Your PDF is encrypted to prevent text extraction. The user password is empty so
it can be viewed, but the rights prevent text extraction. Try it with Adobe
Reader and you'll get the same effect.
The PDF is also severely broken, try display page 38 and later with any viewer.
> Unable to extract text from PDF
> -------------------------------
>
> Key: PDFBOX-4768
> URL: https://issues.apache.org/jira/browse/PDFBOX-4768
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.18
> Reporter: Jan Vlug
> Priority: Major
> Attachments: kst-31430-3-b3_unextractable.pdf
>
>
> I have a PDF document (see attachment) that can be viewed in Evince, but tika
> text extraction does not work. I think that this is due to a crash in pdfbox.
> I'm also a bit puzzled by the message: "You do not have permission to extract
> text".
> Here the output of the ExtractText command:
> {{java -jar pdfbox-app-2.0.19-20200206.060243-86.jar ExtractText
> kst-31430-3-b3_unextractable.pdf tekst_jan.txt}}
> {{Feb 07, 2020 11:03:15 AM org.apache.pdfbox.pdfparser.COSParser
> validateStreamLength}}
> {{WARNING: The end of the stream doesn't point to the correct offset, using
> workaround to read the stream, stream start position: 211564, length: 3336,
> expected end position: 214900}}
> {{Feb 07, 2020 11:03:15 AM org.apache.pdfbox.pdfparser.COSParser
> parseCOSStream}}
> {{WARNING: stream ends with 'endobj' instead of 'endstream' at offset 225134}}
> {{Exception in thread "main" java.io.IOException: You do not have permission
> to extract text}}
> {{ at
> org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:223)}}
> {{ at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:97)}}
> {{ at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]