Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.

The "MSOfficeParsers" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/MSOfficeParsers?action=diff&rev1=7&rev2=8

  
  == Beta SAX Parsers for .docx and .pptx ==
  
- As of Tika 1.15, there are experimental/beta SAX parsers for .docx files.  On 
very large files (e.g. "War and Peace"), this parser appears to be 4x faster 
and require far less memory than our traditional DOM based parsers.  For 
smaller files, the gain is not nearly as great.  For the 386MB pptx submitted 
on TIKA-2201, it would have taken ~60GB to load the file in memory.
+ As of Tika 1.15, there are experimental/beta SAX parsers for .docx files.  On 
very large files (e.g. "War and Peace"), this parser appears to be 4x faster 
and require far less memory than our traditional DOM based parsers.  For 
smaller files, the gain is not nearly as great.  For the 386MB pptx submitted 
on TIKA-2201, it would have taken ~60GB to load the file in memory.  See also, 
the DOCX file submitted on TIKA-2847, which is only 3.6MB, but decompresses to 
~100MB.
  
  These parsers are still in their early stages and don't have all of the 
features of the DOM parsers.  However, the .docx parser does offer 
parameterization to include or exclude deleted text.
  

Reply via email to