https://issues.apache.org/bugzilla/show_bug.cgi?id=57463

            Bug ID: 57463
           Summary: OutOfMemeoryError while extracting text from DOCX
                    files
           Product: POI
           Version: 3.11-FINAL
          Hardware: PC
                OS: All
            Status: NEW
          Severity: blocker
          Priority: P2
         Component: POI Overall
          Assignee: [email protected]
          Reporter: [email protected]

Tika/POI text extraction for Lucene indexing quite often crashes server
processes due to excessive memory requirements.

E.g. the < 10MB document
https://www.eba.europa.eu/documents/10180/359626/Annex+XIV_Data+point+definition_COREP.docx
requires about 3.5GB main memory for test extraction.

When the heap dump is analyzed it turns out that large amounts of XMLBeans
objects are held in memory.

Class Name                                     |   Objects | Shallow Heap | 
Retained Heap
-------------------------------------------------------------------------------------------
org.apache.xmlbeans.impl.store.Xobj$ElementXobj| 2.763.489 |  265.294.944 | >=
511.530.832
org.apache.xmlbeans.impl.store.Xobj$AttrXobj   | 2.797.953 |  246.219.864 | >=
246.233.144
-------------------------------------------------------------------------------------------


The stack extracted from the heap dump was


"QuartzScheduler_Worker-3" daemon prio=5 tid=24 RUNNABLE
    at
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseCdataLiteral(PiccoloLexer.java:3027)
    at
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseQuotedTagValue(PiccoloLexer.java:2936)
    at
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseAttributesNS(PiccoloLexer.java:1754)
    at
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseOpenTagNS(PiccoloLexer.java:1521)
    at
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseTagNS(PiccoloLexer.java:1362)
    at
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1293)
    at
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
    at
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4808)
       Local Variable: int[]#4831
       Local Variable: int[]#4833
       Local Variable: byte[]#1722
       Local Variable: org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer#1
    at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
    at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
    at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
       Local Variable: org.apache.xmlbeans.impl.piccolo.xml.Piccolo#1
    at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3454)
       Local Variable: org.xml.sax.InputSource#1
       Local Variable: org.apache.xmlbeans.impl.store.Locale$PiccoloSaxLoader#1
    at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1276)
    at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1263)
       Local Variable: org.apache.xmlbeans.impl.store.Locale#3
    at
org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
       Local Variable: org.apache.xmlbeans.impl.schema.SchemaTypeLoaderImpl#1
       Local Variable: org.apache.xmlbeans.impl.schema.SchemaTypeImpl#89
    at
org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(<unknown
string>)
       Local Variable: java.util.zip.ZipFile$1#1
    at
org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:134)
    at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159)
       Local Variable: java.util.HashMap#24338
       Local Variable: org.apache.poi.xwpf.usermodel.XWPFFactory#1
    at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:116)
       Local Variable: org.apache.poi.xwpf.usermodel.XWPFDocument#1
    at
org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:53)
       Local Variable: org.apache.poi.xwpf.extractor.XWPFWordExtractor#1
    at
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:180)
       Local Variable: org.apache.poi.xwpf.usermodel.XWPFRelation[]#1
       Local Variable: org.apache.poi.xwpf.usermodel.XWPFRelation#4
       Local Variable:
org.apache.poi.openxml4j.opc.PackageRelationshipCollection#1
       Local Variable: org.apache.poi.openxml4j.opc.ZipPackagePart#1
    at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:87)
       Local Variable: org.apache.poi.openxml4j.opc.ZipPackage#1
       Local Variable: java.util.Locale#1
    at
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
       Local Variable: org.apache.tika.sax.TaggedContentHandler#1
       Local Variable: org.apache.tika.io.TemporaryResources#1
       Local Variable: org.apache.tika.parser.microsoft.ooxml.OOXMLParser#2
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
       Local Variable: org.apache.tika.sax.TaggedContentHandler#2
       Local Variable: org.apache.tika.parser.DefaultParser#2
       Local Variable: org.apache.tika.io.TemporaryResources#2
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
       Local Variable: org.apache.tika.io.TikaInputStream#1
       Local Variable: org.apache.tika.sax.SecureContentHandler#1
       Local Variable: org.apache.tika.parser.AutoDetectParser#2
       Local Variable: org.apache.tika.sax.BodyContentHandler#1
       Local Variable: org.apache.tika.io.TemporaryResources#3
       Local Variable: org.apache.tika.mime.MediaType#1153
    at org.apache.tika.Tika.parseToString(Tika.java:380)
       Local Variable: org.apache.tika.parser.ParseContext#1
       Local Variable: org.apache.tika.metadata.Metadata#1
       Local Variable: java.io.FileInputStream#3
       Local Variable: org.apache.tika.Tika#2
       Local Variable: org.apache.tika.sax.WriteOutContentHandler#1
    at ...

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to