I've added SAX parsers for pptx and docx over on Apache Tika.  These
rely on POI for OPCPackage, a bunch of other classes and overall
design.

I've thought about moving that code into POI, but I haven't found the
time or need, and the code is my typical kludgy-mess...and I don't
want to pollute POI any more than I have.

Take a look over on Tika and see if those will work for you.  Let me
know what you think...

References:
https://wiki.apache.org/tika/MSOfficeParsers

https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXSLFPowerPointExtractorDecorator.java

https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java

On Thu, Feb 14, 2019 at 8:53 AM Kalam, Venkata Krishna Chaitanya
<vka...@informatica.com.invalid> wrote:
>
> Hi team
> We are trying to read the data from office  documents like xlsx, xls, docx 
> etc.,. But we are facing memory issues while reading OOXML file formatted 
> files,  of large size(around 100 MB) using POI apis. For xls/xlsx formats 
> there are event based APIs which solve the memory issue(XSSF/HSSF event based 
> API). But for reading word files or ppt files, there are no event based APIs. 
> We have to create XWPF/HWPF Document which consumes lot of memory , ex: for 
> 45 MB DOCX file, the heap size to prepare XWPFDocument it's taking 12GB 
> memory.
>
> So similar to Xlsx files, is there any plan to provide event based apis for 
> rest of office documents.?
> And if there is any workaround to read the data with less memory consumption. 
> Please let me know? Our use case is to just read the data.
>
> Thanks
> Chaitanya

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org
For additional commands, e-mail: dev-h...@poi.apache.org

Reply via email to