[jira] [Commented] (TIKA-2159) Handle pre-parse embedded object exceptions uniformly and more robustly

Tim Allison (JIRA) Tue, 08 Nov 2016 11:26:08 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15648535#comment-15648535
 ]


Tim Allison commented on TIKA-2159:
-----------------------------------

In addition to corrupt images, make sure to include cases where required 
dependencies aren't on class path:

{noformat}
Caused by: org.apache.pdfbox.filter.MissingImageReaderException: Cannot read 
JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed
        at org.apache.pdfbox.filter.Filter.findImageReader(Filter.java:128)
        at org.apache.pdfbox.filter.JPXFilter.readJPX(JPXFilter.java:72)
        at org.apache.pdfbox.filter.JPXFilter.decode(JPXFilter.java:56)
        at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:69)
        at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:162)
        at 
org.apache.pdfbox.pdmodel.common.PDStream.createInputStream(PDStream.java:235)
        at 
org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.<init>(PDImageXObject.java:147)
        at 
org.apache.pdfbox.pdmodel.graphics.PDXObject.createXObject(PDXObject.java:70)
        at 
org.apache.pdfbox.pdmodel.PDResources.getXObject(PDResources.java:409)
        at 
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:171)
        at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:152)
        at 
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:393)
        at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:141)
        at 
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
        at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
{noformat}

> Handle pre-parse embedded object exceptions uniformly and more robustly
> -----------------------------------------------------------------------
>
>                 Key: TIKA-2159
>                 URL: https://issues.apache.org/jira/browse/TIKA-2159
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Tim Allison
>            Priority: Minor
>
> When an embedded document is parsed and causes an exception, we're currently 
> catching that and swallowing it in ParsingEmbeddedDocumentExtractor (the 
> default) or reporting it in the RecursiveParserWrapper by storing the 
> stacktrace in the Metadata of the embedded document.
> However, if there's an exception during detection on the embedded stream or 
> on getting the stream _before_ the stream hits the parser, we aren't handling 
> that uniformly or robustly across parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2159) Handle pre-parse embedded object exceptions uniformly and more robustly

Reply via email to