docx extractor

Tim Allison (JIRA) Tue, 29 Nov 2016 05:44:17 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15705334#comment-15705334
 ]


Tim Allison commented on TIKA-1321:
-----------------------------------

In a few weeks...well, maybe not.

The current dev implementation doesn't handle everything that our current 
extractor does, but it does handle some things our current implementation 
doesn't.

The current implementation uses beans for all parts that aren't document.xml or 
the glossary-document, but then SAX for the document and glossary document.

Wall clock sequential tests for our test suite's docx files (100 iterations):
Current: 25 seconds
Proposed: 16 seconds

Once we add "War and Peace" to our test suite's docx files (10 iterations):
Current: 89 seconds
Proposed: 15 seconds

These initial benchmarks suggest that a SAX/read-only docx extractor might be 
worth the effort.

> Add experimental SAX/Streaming XWPF/docx extractor
> --------------------------------------------------
>
>                 Key: TIKA-1321
>                 URL: https://issues.apache.org/jira/browse/TIKA-1321
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Minor
>
> I'd like to contribute an experimental streaming extractor for docx.  I 
> should have something ready for committing in a few weeks.  I'll attach 
> drafts as they're ready.
> At least for a couple of releases, I'd like to keep it in 
> o.a.t.parser.microsoft.ooxml.experimental if that makes sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1321) Add experimental SAX/Streaming XWPF/docx extractor

Reply via email to