[
https://issues.apache.org/jira/browse/TIKA-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15705334#comment-15705334
]
Tim Allison commented on TIKA-1321:
-----------------------------------
In a few weeks...well, maybe not.
The current dev implementation doesn't handle everything that our current
extractor does, but it does handle some things our current implementation
doesn't.
The current implementation uses beans for all parts that aren't document.xml or
the glossary-document, but then SAX for the document and glossary document.
Wall clock sequential tests for our test suite's docx files (100 iterations):
Current: 25 seconds
Proposed: 16 seconds
Once we add "War and Peace" to our test suite's docx files (10 iterations):
Current: 89 seconds
Proposed: 15 seconds
These initial benchmarks suggest that a SAX/read-only docx extractor might be
worth the effort.
> Add experimental SAX/Streaming XWPF/docx extractor
> --------------------------------------------------
>
> Key: TIKA-1321
> URL: https://issues.apache.org/jira/browse/TIKA-1321
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Reporter: Tim Allison
> Assignee: Tim Allison
> Priority: Minor
>
> I'd like to contribute an experimental streaming extractor for docx. I
> should have something ready for committing in a few weeks. I'll attach
> drafts as they're ready.
> At least for a couple of releases, I'd like to keep it in
> o.a.t.parser.microsoft.ooxml.experimental if that makes sense.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)