[
https://issues.apache.org/jira/browse/TIKA-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15648535#comment-15648535
]
Tim Allison commented on TIKA-2159:
-----------------------------------
In addition to corrupt images, make sure to include cases where required
dependencies aren't on class path:
{noformat}
Caused by: org.apache.pdfbox.filter.MissingImageReaderException: Cannot read
JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed
at org.apache.pdfbox.filter.Filter.findImageReader(Filter.java:128)
at org.apache.pdfbox.filter.JPXFilter.readJPX(JPXFilter.java:72)
at org.apache.pdfbox.filter.JPXFilter.decode(JPXFilter.java:56)
at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:69)
at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:162)
at
org.apache.pdfbox.pdmodel.common.PDStream.createInputStream(PDStream.java:235)
at
org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.<init>(PDImageXObject.java:147)
at
org.apache.pdfbox.pdmodel.graphics.PDXObject.createXObject(PDXObject.java:70)
at
org.apache.pdfbox.pdmodel.PDResources.getXObject(PDResources.java:409)
at
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:171)
at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:152)
at
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393)
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:141)
at
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
{noformat}
> Handle pre-parse embedded object exceptions uniformly and more robustly
> -----------------------------------------------------------------------
>
> Key: TIKA-2159
> URL: https://issues.apache.org/jira/browse/TIKA-2159
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Tim Allison
> Priority: Minor
>
> When an embedded document is parsed and causes an exception, we're currently
> catching that and swallowing it in ParsingEmbeddedDocumentExtractor (the
> default) or reporting it in the RecursiveParserWrapper by storing the
> stacktrace in the Metadata of the embedded document.
> However, if there's an exception during detection on the embedded stream or
> on getting the stream _before_ the stream hits the parser, we aren't handling
> that uniformly or robustly across parsers.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)