Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "MSOfficeParsers" page has been changed by TimothyAllison: https://wiki.apache.org/tika/MSOfficeParsers?action=diff&rev1=6&rev2=7 = Tika's MSOffice Parsers (Apache POI) = - == Experimental SAX Parser for .docx and .pptx == + == Beta SAX Parsers for .docx and .pptx == - As of Tika 1.15, there are experimental SAX parsers for .docx files. On very large files (e.g. "War and Peace"), this parser appears to be 4x faster and require far less memory than our traditional DOM based parsers. For smaller files, the gain is not nearly as great. For the 386MB pptx submitted on TIKA-2201, it would have taken ~60GB to load the file in memory. + As of Tika 1.15, there are experimental/beta SAX parsers for .docx files. On very large files (e.g. "War and Peace"), this parser appears to be 4x faster and require far less memory than our traditional DOM based parsers. For smaller files, the gain is not nearly as great. For the 386MB pptx submitted on TIKA-2201, it would have taken ~60GB to load the file in memory. These parsers are still in their early stages and don't have all of the features of the DOM parsers. However, the .docx parser does offer parameterization to include or exclude deleted text.
