This is very helpful, Mark.  Thank you.  Y, I'd add handling of the glossary 
document, as well.

As I was working on the SAX parser for Tika, it "feels" more robust from an 
extraction standpoint because it is extracting all "w:t",...with a few 
exceptions (deltext, moveFrom, alternatecontent, etc).  Still needs more work, 
but it sounds from the list you've compiled that the new parser might not be a 
bad idea...if the sole goal is extraction.



-----Original Message-----
From: Murphy, Mark [mailto:murphym...@metalexmfg.com] 
Sent: Monday, December 12, 2016 3:56 PM
To: 'POI Developers List' <dev@poi.apache.org>
Subject: RE: got docx?

Lol, just from looking through the code, and standard, there are a number of 
things that I know are not handled or not handled properly in XWPF. A quick 
subset from the top of my head includes:
* Pictures that are not inlined in the main document, header, or footer parts.
* Sections
* SDT content
* Alternate content
* Many of the shared portions of the spec
* Tables have problems
* Versions - This is a tag that gets added to every node telling which save 
(version) it was created for.
* Revisions - This is the stuff that tells what was changed and how. Which 
nodes were inserted, or changed, or deleted, or moved, and when, and by whom.

There are thousands of hours left just to get it to version1 of the spec.

But yes, thanks Dominik for providing this batch of test documents. It should 
help prioritize fixes.

-----Original Message-----
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Monday, December 12, 2016 9:58 AM
To: POI Developers List <dev@poi.apache.org>
Cc: d...@tika.apache.org
Subject: RE: got docx?

To close the loop and share my gratitude publicly...

Thank you, Dominik, for transferring 41k, 5GB of docx/dotx to our regression 
corpus!

I’ve already found a number of “areas for improvement” in Tika's experimental 
docx SAX parser, and a few areas for improvement in POI's XWPFDocument/DOM 
parser…all thanks to your documents and your common crawl code.  

Thank you!


Cheers,

        Tim

B�KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB��[��X��ܚX�KK[XZ[
�]�][��X��ܚX�P�K�\X�K�ܙ�B��܈Y][ۘ[��[X[��K[XZ[
�]�Z[�K�\X�K�ܙ�B�B
B�KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB��[��X��ܚX�KK[XZ[
�]�][��X��ܚX�P�K�\X�K�ܙ�B��܈Y][ۘ[��[X[��K[XZ[
�]�Z[�K�\X�K�ܙ�B�B

Reply via email to