https://issues.apache.org/bugzilla/show_bug.cgi?id=57463
Bug ID: 57463
Summary: OutOfMemeoryError while extracting text from DOCX
files
Product: POI
Version: 3.11-FINAL
Hardware: PC
OS: All
Status: NEW
Severity: blocker
Priority: P2
Component: POI Overall
Assignee: [email protected]
Reporter: [email protected]
Tika/POI text extraction for Lucene indexing quite often crashes server
processes due to excessive memory requirements.
E.g. the < 10MB document
https://www.eba.europa.eu/documents/10180/359626/Annex+XIV_Data+point+definition_COREP.docx
requires about 3.5GB main memory for test extraction.
When the heap dump is analyzed it turns out that large amounts of XMLBeans
objects are held in memory.
Class Name | Objects | Shallow Heap |
Retained Heap
-------------------------------------------------------------------------------------------
org.apache.xmlbeans.impl.store.Xobj$ElementXobj| 2.763.489 | 265.294.944 | >=
511.530.832
org.apache.xmlbeans.impl.store.Xobj$AttrXobj | 2.797.953 | 246.219.864 | >=
246.233.144
-------------------------------------------------------------------------------------------
The stack extracted from the heap dump was
"QuartzScheduler_Worker-3" daemon prio=5 tid=24 RUNNABLE
at
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseCdataLiteral(PiccoloLexer.java:3027)
at
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseQuotedTagValue(PiccoloLexer.java:2936)
at
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseAttributesNS(PiccoloLexer.java:1754)
at
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseOpenTagNS(PiccoloLexer.java:1521)
at
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseTagNS(PiccoloLexer.java:1362)
at
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1293)
at
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
at
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4808)
Local Variable: int[]#4831
Local Variable: int[]#4833
Local Variable: byte[]#1722
Local Variable: org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer#1
at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
Local Variable: org.apache.xmlbeans.impl.piccolo.xml.Piccolo#1
at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3454)
Local Variable: org.xml.sax.InputSource#1
Local Variable: org.apache.xmlbeans.impl.store.Locale$PiccoloSaxLoader#1
at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1276)
at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1263)
Local Variable: org.apache.xmlbeans.impl.store.Locale#3
at
org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
Local Variable: org.apache.xmlbeans.impl.schema.SchemaTypeLoaderImpl#1
Local Variable: org.apache.xmlbeans.impl.schema.SchemaTypeImpl#89
at
org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(<unknown
string>)
Local Variable: java.util.zip.ZipFile$1#1
at
org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:134)
at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159)
Local Variable: java.util.HashMap#24338
Local Variable: org.apache.poi.xwpf.usermodel.XWPFFactory#1
at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:116)
Local Variable: org.apache.poi.xwpf.usermodel.XWPFDocument#1
at
org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:53)
Local Variable: org.apache.poi.xwpf.extractor.XWPFWordExtractor#1
at
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:180)
Local Variable: org.apache.poi.xwpf.usermodel.XWPFRelation[]#1
Local Variable: org.apache.poi.xwpf.usermodel.XWPFRelation#4
Local Variable:
org.apache.poi.openxml4j.opc.PackageRelationshipCollection#1
Local Variable: org.apache.poi.openxml4j.opc.ZipPackagePart#1
at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:87)
Local Variable: org.apache.poi.openxml4j.opc.ZipPackage#1
Local Variable: java.util.Locale#1
at
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
Local Variable: org.apache.tika.sax.TaggedContentHandler#1
Local Variable: org.apache.tika.io.TemporaryResources#1
Local Variable: org.apache.tika.parser.microsoft.ooxml.OOXMLParser#2
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
Local Variable: org.apache.tika.sax.TaggedContentHandler#2
Local Variable: org.apache.tika.parser.DefaultParser#2
Local Variable: org.apache.tika.io.TemporaryResources#2
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
Local Variable: org.apache.tika.io.TikaInputStream#1
Local Variable: org.apache.tika.sax.SecureContentHandler#1
Local Variable: org.apache.tika.parser.AutoDetectParser#2
Local Variable: org.apache.tika.sax.BodyContentHandler#1
Local Variable: org.apache.tika.io.TemporaryResources#3
Local Variable: org.apache.tika.mime.MediaType#1153
at org.apache.tika.Tika.parseToString(Tika.java:380)
Local Variable: org.apache.tika.parser.ParseContext#1
Local Variable: org.apache.tika.metadata.Metadata#1
Local Variable: java.io.FileInputStream#3
Local Variable: org.apache.tika.Tika#2
Local Variable: org.apache.tika.sax.WriteOutContentHandler#1
at ...
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]