[ 
https://issues.apache.org/jira/browse/TIKA-3847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601113#comment-17601113
 ] 

Tilman Hausherr commented on TIKA-3847:
---------------------------------------

Oh, now that I've slept, I realize it can be null if /Encoding is null.

> NullPointerException when processing pdf document(Allow proceed on 
> RuntimeException)
> ------------------------------------------------------------------------------------
>
>                 Key: TIKA-3847
>                 URL: https://issues.apache.org/jira/browse/TIKA-3847
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.4.1
>            Reporter: Yurii
>            Priority: Major
>         Attachments: TIKA-3847.patch
>
>
> I have a pdf document with some corrupted pages(throws error even in PDF 
> readers like adobe acrobat). However there are only few pages of 370 are 
> failing.
> The issue is that first corrupted page is 15th and whole document processing 
> failed after that.
> There is a nullPointer exception thrown on getting fonts of the corrupted 
> page.
> What I propose is to allow(by config, default false) handle any runtime 
> exception like we do for IntermediaryIOExceptions.
> Unfortunately due to NDA I can't share a document for debug purposes but here 
> is a stacktrace below:
> {code:java}
> java.lang.NullPointerException
>     at 
> org.apache.pdfbox.pdmodel.font.PDType0Font.readCode(PDType0Font.java:574)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:745)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.showTextString(PDFStreamEngine.java:635)
>     at 
> org.apache.pdfbox.contentstream.operator.text.ShowText.process(ShowText.java:56)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:966)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:541)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:516)
>     at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155)
>     at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:155)
>     at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:363)
>     at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:291)
>     at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238)
>     at 
> org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:202)
>     at 
> org.apache.tika.parser.pdf.PDF2XHTML$AngleDetectingPDF2XHTML.detectAnglesAndProcessPage(PDF2XHTML.java:307)
>     at 
> org.apache.tika.parser.pdf.PDF2XHTML$AngleDetectingPDF2XHTML.processPage(PDF2XHTML.java:293)
>     at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1204)
>     at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238)
>     at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:108)
>     at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:196) {code}
> I will attach proposed patch soon.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to