Don created TIKA-3017:
-------------------------
Summary: OOM in XSLFSheet.java
Key: TIKA-3017
URL: https://issues.apache.org/jira/browse/TIKA-3017
Project: Tika
Issue Type: Bug
Affects Versions: 1.19
Reporter: Don
Attachments: OOM_Slide_18.pptx
When tiki parses the attached power point slide it OOMs every time. The slide
is a scrubbed slide from a Microsoft PowerPoint deck. Unfortunately I have no
idea how the slide was created. When you open the slide it will look like it is
a totally blank slide, however if you perform a select all on the slide while
it is open in PowerPoint you will see there are two items contained in the
slide, one inside the other. The person that created the slide deck is not
longer available to give details as to how the slide was created. The two items
in the slide deck appear to be text boxes, but I am not sure this is the case
because if either one is removed and replace with a textbox using MS PowerPoint
the OOM does not happen anymore. Also, if the slide is open in LibreOffice and
then saved, the OOM does not happen. There seems to be something specific about
whatever these items really are and how they were created.
The following is the stack trace of the OOM when it is parsed by tikia:
{noformat}
Executor task launch worker for task 47360
at java.lang.OutOfMemoryError.<init>()V (OutOfMemoryError.java:48)
at java.util.Arrays.copyOf([JI)[J (Arrays.java:3308)
at java.util.BitSet.ensureCapacity(I)V (BitSet.java:337)
at java.util.BitSet.expandTo(I)V (BitSet.java:352)
at java.util.BitSet.set(I)V (BitSet.java:447)
at org.apache.poi.xslf.usermodel.XSLFSheet.registerShapeId(I)V
(XSLFSheet.java:123)
at
org.apache.poi.xslf.usermodel.XSLFDrawing.<init>(Lorg/apache/poi/xslf/usermodel/XSLFSheet;Lorg/openxmlformats/schemas/presentationml/x2006/main/CTGroupShape;)V
(XSLFDrawing.java:47)
at org.apache.poi.xslf.usermodel.XSLFSheet.initDrawingAndShapes()V
(XSLFSheet.java:214)
at org.apache.poi.xslf.usermodel.XSLFSheet.getShapes()Ljava/util/List;
(XSLFSheet.java:201)
at
org.apache.tika.parser.microsoft.ooxml.XSLFPowerPointExtractorDecorator.buildXHTML(Lorg/apache/tika/sax/XHTMLContentHandler;)V
(XSLFPowerPointExtractorDecorator.java:110)
at
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(Lorg/xml/sax/ContentHandler;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/parser/ParseContext;)V
(AbstractOOXMLExtractor.java:136)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(Ljava/io/InputStream;Lorg/xml/sax/ContentHandler;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/parser/ParseContext;)V
(OOXMLExtractorFactory.java:156)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(Ljava/io/InputStream;Lorg/xml/sax/ContentHandler;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/parser/ParseContext;)V
(OOXMLParser.java:110)
at
org.apache.tika.parser.CompositeParser.parse(Ljava/io/InputStream;Lorg/xml/sax/ContentHandler;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/parser/ParseContext;)V
(CompositeParser.java:280)
at
org.apache.tika.parser.CompositeParser.parse(Ljava/io/InputStream;Lorg/xml/sax/ContentHandler;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/parser/ParseContext;)V
(CompositeParser.java:280)
at
org.apache.tika.parser.AutoDetectParser.parse(Ljava/io/InputStream;Lorg/xml/sax/ContentHandler;Lorg/apache/tika/metadata/Metadata;Lorg/apache/tika/parser/ParseContext;)V
(AutoDetectParser.java:143)
at
{noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)