Lol, just from looking through the code, and standard, there are a number of 
things that I know are not handled or not handled properly in XWPF. A quick 
subset from the top of my head includes:
* Pictures that are not inlined in the main document, header, or footer parts.
* Sections
* SDT content
* Alternate content
* Many of the shared portions of the spec
* Tables have problems
* Versions - This is a tag that gets added to every node telling which save 
(version) it was created for.
* Revisions - This is the stuff that tells what was changed and how. Which 
nodes were inserted, or changed, or deleted, or moved, and when, and by whom.

There are thousands of hours left just to get it to version1 of the spec.

But yes, thanks Dominik for providing this batch of test documents. It should 
help prioritize fixes.

-----Original Message-----
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Monday, December 12, 2016 9:58 AM
To: POI Developers List <dev@poi.apache.org>
Cc: d...@tika.apache.org
Subject: RE: got docx?

To close the loop and share my gratitude publicly...

Thank you, Dominik, for transferring 41k, 5GB of docx/dotx to our regression 
corpus!

I’ve already found a number of “areas for improvement” in Tika's experimental 
docx SAX parser, and a few areas for improvement in POI's XWPFDocument/DOM 
parser…all thanks to your documents and your common crawl code.  

Thank you!


Cheers,

        Tim

B�KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB��[��X��ܚX�KK[XZ[
�]�][��X��ܚX�P�K�\X�K�ܙ�B��܈Y][ۘ[��[X[��K[XZ[
�]�Z[�K�\X�K�ܙ�B�B

Reply via email to