[ 
https://issues.apache.org/jira/browse/TIKA-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568239#comment-16568239
 ] 

Tim Allison edited comment on TIKA-2703 at 8/3/18 2:19 PM:
-----------------------------------------------------------

:P
Thank you for sharing the file with me.  Bottom line: there's a bug in Tika.

Your file has a 216 MB uncompressed chart (chart3.xml).  When I extracted text 
via streaming, I got a ZipBombException after it wrote 4GB of data.

Sheet7.xml has 70 shapes. Each shape is a member of the same parent drawing 
that includes chart3.xml.  When I wrote the code, I thought that the drawing 
and its chart data belonged to the shape.  This is wrong.  The shape belongs to 
the drawing.

So, as we iterated through the 70 shapes and processed the chart data each 
time, we wound up processing your 216MB xml file 70 times.

The fix is easy: make sure to process the shape's parent drawing's charts only 
once.


was (Author: [email protected]):
:P
Thank you for sharing the file with me.  Bottom line: there's a bug in Tika.

Your file has a 216 MB uncompressed chart (chart3.xml).  When I extracted text 
via streaming, I got a ZipBombException after it wrote 4GB of data.

Sheet7.xml has 70 shapes. Each shape is a member of the same parent drawing 
that includes chart3.xml.  When I wrote the code, I thought that the drawing 
and its chart data belonged to the shape.  This is wrong.  The shape belongs to 
the drawing.

So, as we iterated through the 70 shapes and processed the chart data each 
time, we wound up processing your 216MB xml file 70 times.

The fix is easy: make sure to process the shape's parent's drawing's charts 
only once.

> Error indexing a xlsx file
> --------------------------
>
>                 Key: TIKA-2703
>                 URL: https://issues.apache.org/jira/browse/TIKA-2703
>             Project: Tika
>          Issue Type: Bug
>         Environment: Tika 1.17 on solr 7.3 
>            Reporter: Mario Bisonti
>            Priority: Major
>
> Hallo.
> Indexing a xlsx file of 38 MB
>  
> I obtain the error:
> Error from server at http://localhost:8983/solr/core_share: Expected mime 
> type application/xml but got text/html. <html> <head> <meta 
> http-equiv="Content-Type" content="text/html;charset=utf-8"/> <title>Error 
> 500 Server Error</title> </head> <body><h2>HTTP ERROR 500</h2> <p>Problem 
> accessing /solr/core_share/update/extract. Reason: <pre> Server 
> Error</pre></p><h3>Caused by:</h3><pre>java.lang.OutOfMemoryError at 
> java.base/java.lang.AbstractStringBuilder.hugeCapacity(AbstractStringBuilder.java:188)
>  at 
> java.base/java.lang.AbstractStringBuilder.newCapacity(AbstractStringBuilder.java:180)
>  at 
> java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:147)
>  at 
> java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:660)
>  at java.base/java.lang.StringBuilder.append(StringBuilder.java:195) at 
> org.apache.solr.handler.extraction.SolrContentHandler.characters(SolrContentHandler.java:302)
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>  at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>  at 
> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46) 
> at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82) 
> at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140) 
> at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
>  at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279)
>  at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:306)
>  at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLTikaBodyPartHandler.run(OOXMLTikaBodyPartHandler.java:147)
>  at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLWordAndPowerPointTextHandler.handleEndOfRun(OOXMLWordAndPowerPointTextHandler.java:468)
>  at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLWordAndPowerPointTextHandler.endElement(OOXMLWordAndPowerPointTextHandler.java:450)
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>  at 
> java.xml/com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:609)
>  at 
> java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1714)
>  at 
> java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2879)
>  at 
> java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:602)
>  at 
> java.xml/com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:112)
>  at 
> java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:532)
>  at 
> java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:888)
>  at 
> java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:824)
>  at 
> java.xml/com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
>  at 
> java.xml/com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
>  at 
> java.xml/com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:635)
>  at 
> java.xml/com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:324)
>  at java.xml/javax.xml.parsers.SAXParser.parse(SAXParser.java:197) at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleGeneralTextContainingPart(AbstractOOXMLExtractor.java:506)
>  at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.processShapes(XSSFExcelExtractorDecorator.java:279)
>  at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:185)
>  at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:135)
>  at 
> org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:120)
>  at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:143)
>  at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at 
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
>  at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
>  at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:195)
>  at org.apache.solr.core.SolrCore.execute(SolrCore.java:2503) at 
> org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:711) at 
> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:517) at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:384)
>  at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:330)
>  at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1629)
>  at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533) at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) 
> at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) 
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:190)
>  at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:188)
>  at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:168)
>  at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473) 
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:166)
>  at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) 
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
>  at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)
>  at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>  at 
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
>  at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>  at org.eclipse.jetty.server.Server.handle(Server.java:530) at 
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:347) at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:256) 
> at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:279)
>  at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102) at 
> org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:124) at 
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:247)
>  at 
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:140)
>  at 
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
>  at 
> org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:382)
>  at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:708)
>  at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:626)
>  at java.base/java.lang.Thread.run(Thread.java:844) </pre> </body> </html>
>  
>  
> How could I solve it?
>  
> Thanks a lot
> Mario



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to