I've added SAX parsers for pptx and docx over on Apache Tika. These rely on POI for OPCPackage, a bunch of other classes and overall design.
I've thought about moving that code into POI, but I haven't found the time or need, and the code is my typical kludgy-mess...and I don't want to pollute POI any more than I have. Take a look over on Tika and see if those will work for you. Let me know what you think... References: https://wiki.apache.org/tika/MSOfficeParsers https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXSLFPowerPointExtractorDecorator.java https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java On Thu, Feb 14, 2019 at 8:53 AM Kalam, Venkata Krishna Chaitanya <vka...@informatica.com.invalid> wrote: > > Hi team > We are trying to read the data from office documents like xlsx, xls, docx > etc.,. But we are facing memory issues while reading OOXML file formatted > files, of large size(around 100 MB) using POI apis. For xls/xlsx formats > there are event based APIs which solve the memory issue(XSSF/HSSF event based > API). But for reading word files or ppt files, there are no event based APIs. > We have to create XWPF/HWPF Document which consumes lot of memory , ex: for > 45 MB DOCX file, the heap size to prepare XWPFDocument it's taking 12GB > memory. > > So similar to Xlsx files, is there any plan to provide event based apis for > rest of office documents.? > And if there is any workaround to read the data with less memory consumption. > Please let me know? Our use case is to just read the data. > > Thanks > Chaitanya --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org For additional commands, e-mail: dev-h...@poi.apache.org