Konstantin Avdeev created CONNECTORS-1307:
---------------------------------------------

             Summary: Tika extractor infinite loop on error
                 Key: CONNECTORS-1307
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1307
             Project: ManifoldCF
          Issue Type: Bug
          Components: Tika extractor
    Affects Versions: ManifoldCF 2.4
         Environment: windows 64bit, java version "1.8.0_77", 
pdfbox-1.8.10.jar, tika-parsers-1.10.jar
            Reporter: Konstantin Avdeev


The Tika extractor gets stuck (is trying to parse the same document again and 
again) on the following error:
{code}
FATAL 2016-04-29 10:55:45,505 (Worker thread '41') - Error tossed: null
java.lang.StackOverflowError
        at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
        at 
org.apache.tika.sax.SecureContentHandler.startElement(SecureContentHandler.java:250)
        at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
        at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
        at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
        at 
org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
        at 
org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
        at 
org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:296)
        at 
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:348)
        at 
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
        at 
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
        at 
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
        at 
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
        at 
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
        at 
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
        at 
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
        at 
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
        at 
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
{code}

-Xss - is the default one, which is, I believe, 512k.
We can increase the stack trace size, but I think, this error should not lead 
to such situation.
Thanks a lot!




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to