[ https://issues.apache.org/jira/browse/PDFBOX-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032556#comment-17032556 ]
Tim Allison commented on PDFBOX-4768: ------------------------------------- To complement Tilman's points...qpdf complains about this file: {noformat} WARNING: kst-31430-3-b3_unextractable.pdf: file is damaged WARNING: kst-31430-3-b3_unextractable.pdf (offset 638658): xref not found WARNING: kst-31430-3-b3_unextractable.pdf: Attempting to reconstruct cross-reference table WARNING: kst-31430-3-b3_unextractable.pdf (object 123 0, offset 214900): expected endstream WARNING: kst-31430-3-b3_unextractable.pdf (object 123 0, offset 211564): attempting to recover stream length WARNING: kst-31430-3-b3_unextractable.pdf (object 123 0, offset 211564): recovered stream length: 13564 qpdf: operation succeeded with warnings; resulting file may have some problems {noformat} Tika's exception is: {noformat} Caused by: java.io.IOException: Unknown dir object c=')' cInt=41 peek=')' peekInt=41 at offset 8689 at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:966) at org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:636) at org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:175) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:513) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:480) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:153) at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:153) at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:867) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:124) {noformat} > Unable to extract text from PDF > ------------------------------- > > Key: PDFBOX-4768 > URL: https://issues.apache.org/jira/browse/PDFBOX-4768 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 2.0.18 > Reporter: Jan Vlug > Priority: Major > Attachments: kst-31430-3-b3_unextractable.pdf > > > I have a PDF document (see attachment) that can be viewed in Evince, but tika > text extraction does not work. I think that this is due to a crash in pdfbox. > I'm also a bit puzzled by the message: "You do not have permission to extract > text". > Here the output of the ExtractText command: > {{java -jar pdfbox-app-2.0.19-20200206.060243-86.jar ExtractText > kst-31430-3-b3_unextractable.pdf tekst_jan.txt}} > {{Feb 07, 2020 11:03:15 AM org.apache.pdfbox.pdfparser.COSParser > validateStreamLength}} > {{WARNING: The end of the stream doesn't point to the correct offset, using > workaround to read the stream, stream start position: 211564, length: 3336, > expected end position: 214900}} > {{Feb 07, 2020 11:03:15 AM org.apache.pdfbox.pdfparser.COSParser > parseCOSStream}} > {{WARNING: stream ends with 'endobj' instead of 'endstream' at offset 225134}} > {{Exception in thread "main" java.io.IOException: You do not have permission > to extract text}} > {{ at > org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:223)}} > {{ at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:97)}} > {{ at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)}} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org