Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.

The "MSOfficeParsers" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/MSOfficeParsers?action=diff&rev1=6&rev2=7

  = Tika's MSOffice Parsers (Apache POI) =
  
- == Experimental SAX Parser for .docx and .pptx ==
+ == Beta SAX Parsers for .docx and .pptx ==
  
- As of Tika 1.15, there are experimental SAX parsers for .docx files.  On very 
large files (e.g. "War and Peace"), this parser appears to be 4x faster and 
require far less memory than our traditional DOM based parsers.  For smaller 
files, the gain is not nearly as great.  For the 386MB pptx submitted on 
TIKA-2201, it would have taken ~60GB to load the file in memory.
+ As of Tika 1.15, there are experimental/beta SAX parsers for .docx files.  On 
very large files (e.g. "War and Peace"), this parser appears to be 4x faster 
and require far less memory than our traditional DOM based parsers.  For 
smaller files, the gain is not nearly as great.  For the 386MB pptx submitted 
on TIKA-2201, it would have taken ~60GB to load the file in memory.
  
  These parsers are still in their early stages and don't have all of the 
features of the DOM parsers.  However, the .docx parser does offer 
parameterization to include or exclude deleted text.
  

Reply via email to