[Tika Wiki] Update of "MSOfficeParsers" by TimothyAllison

Apache Wiki Wed, 05 Apr 2017 06:14:11 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "MSOfficeParsers" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/MSOfficeParsers?action=diff&rev1=3&rev2=4

  = Tika's MSOffice Parsers (Apache POI) =
  
- == Experimental SAX Parser for .docx ==
+ == Experimental SAX Parser for .docx and .pptx ==
  
- As of Tika 1.15, there is an experimental SAX parser for .docx files.  On 
very large files (e.g. "War and Peace"), this parser appears to be 4x faster 
and require far less memory than our traditional DOM based parser.  For smaller 
files, the gain is not nearly as great, but it is still faster.  This parser is 
still in its early stages and doesn't have all of the features of the DOM 
parser.  However, it does offer parameterization to include or exclude deleted 
text.
+ As of Tika 1.15, there are experimental SAX parsers for .docx files.  On very 
large files (e.g. "War and Peace"), this parser appears to be 4x faster and 
require far less memory than our traditional DOM based parsers.  For smaller 
files, the gain is not nearly as great.  For the 386MB pptx submitted on 
TIKA-2201, it would have taken ~60GB to load the file in memory.
  
+ These parsers are still in their early stages and don't have all of the 
features of the DOM parsers.  However, the .docx parser does offer 
parameterization to include or exclude deleted text.
+ 
- To select it programmatically, set `setUseSAXDocxExtractor` to `true` on an 
OfficeParserConfig and put that in the ParseContext: 
`context.set(OfficeParserConfig.class, officeParserConfig);`.
+ To select it programmatically, set `setUseSAXDocxExtractor` or 
`setUsetSAXPptxExtractor` to `true` on an OfficeParserConfig and put that in 
the ParseContext: `context.set(OfficeParserConfig.class, officeParserConfig);`.
  
  To set it via the config file, try:
  
@@ -17, +19 @@

          <parser class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser">
              <params>
                  <param name="useSAXDocxExtractor" type="bool">true</param>
+                 <param name="useSAXPptxExtractor" type="bool">true</param>
              </params>
          </parser>
      </parsers>
  </properties>
  }}}
  
- See [[https://issues.apache.org/jira/browse/TIKA-1321|TIKA-1321]] for the 
parser and [[https://issues.apache.org/jira/browse/TIKA-2180|TIKA-2180]] for 
some symptoms that the current DOM parser might be slowing you down.
+ See [[https://issues.apache.org/jira/browse/TIKA-1321|TIKA-1321]] for the 
parser and [[https://issues.apache.org/jira/browse/TIKA-2180|TIKA-2180]] and 
[[https://issues.apache.org/jira/browse/TIKA-2201|TIKA-2201]] for some symptoms 
that the current DOM parser might be slowing you down.
-  
+ 
+ 
+ 
  == How to build Tika with POI's trunk ==
  
  You'll need to have the following build tools installed: Ant, Forrest and 
Maven.

[Tika Wiki] Update of "MSOfficeParsers" by TimothyAllison

Reply via email to