[
https://issues.apache.org/jira/browse/TIKA-3847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601113#comment-17601113
]
Tilman Hausherr commented on TIKA-3847:
---------------------------------------
Oh, now that I've slept, I realize it can be null if /Encoding is null.
> NullPointerException when processing pdf document(Allow proceed on
> RuntimeException)
> ------------------------------------------------------------------------------------
>
> Key: TIKA-3847
> URL: https://issues.apache.org/jira/browse/TIKA-3847
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 2.4.1
> Reporter: Yurii
> Priority: Major
> Attachments: TIKA-3847.patch
>
>
> I have a pdf document with some corrupted pages(throws error even in PDF
> readers like adobe acrobat). However there are only few pages of 370 are
> failing.
> The issue is that first corrupted page is 15th and whole document processing
> failed after that.
> There is a nullPointer exception thrown on getting fonts of the corrupted
> page.
> What I propose is to allow(by config, default false) handle any runtime
> exception like we do for IntermediaryIOExceptions.
> Unfortunately due to NDA I can't share a document for debug purposes but here
> is a stacktrace below:
> {code:java}
> java.lang.NullPointerException
> at
> org.apache.pdfbox.pdmodel.font.PDType0Font.readCode(PDType0Font.java:574)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:745)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.showTextString(PDFStreamEngine.java:635)
> at
> org.apache.pdfbox.contentstream.operator.text.ShowText.process(ShowText.java:56)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:966)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:541)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:516)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155)
> at
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:155)
> at
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:363)
> at
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:291)
> at
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238)
> at
> org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:202)
> at
> org.apache.tika.parser.pdf.PDF2XHTML$AngleDetectingPDF2XHTML.detectAnglesAndProcessPage(PDF2XHTML.java:307)
> at
> org.apache.tika.parser.pdf.PDF2XHTML$AngleDetectingPDF2XHTML.processPage(PDF2XHTML.java:293)
> at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1204)
> at
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238)
> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:108)
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:196) {code}
> I will attach proposed patch soon.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)