Konstantin Avdeev created CONNECTORS-1307:
---------------------------------------------
Summary: Tika extractor infinite loop on error
Key: CONNECTORS-1307
URL: https://issues.apache.org/jira/browse/CONNECTORS-1307
Project: ManifoldCF
Issue Type: Bug
Components: Tika extractor
Affects Versions: ManifoldCF 2.4
Environment: windows 64bit, java version "1.8.0_77",
pdfbox-1.8.10.jar, tika-parsers-1.10.jar
Reporter: Konstantin Avdeev
The Tika extractor gets stuck (is trying to parse the same document again and
again) on the following error:
{code}
FATAL 2016-04-29 10:55:45,505 (Worker thread '41') - Error tossed: null
java.lang.StackOverflowError
at
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at
org.apache.tika.sax.SecureContentHandler.startElement(SecureContentHandler.java:250)
at
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at
org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
at
org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
at
org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:296)
at
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:348)
at
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
at
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
at
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
at
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
at
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
at
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
at
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
at
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
at
org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
{code}
-Xss - is the default one, which is, I believe, 512k.
We can increase the stack trace size, but I think, this error should not lead
to such situation.
Thanks a lot!
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)