This is very helpful, Mark. Thank you. Y, I'd add handling of the glossary document, as well.
As I was working on the SAX parser for Tika, it "feels" more robust from an extraction standpoint because it is extracting all "w:t",...with a few exceptions (deltext, moveFrom, alternatecontent, etc). Still needs more work, but it sounds from the list you've compiled that the new parser might not be a bad idea...if the sole goal is extraction. -----Original Message----- From: Murphy, Mark [mailto:murphym...@metalexmfg.com] Sent: Monday, December 12, 2016 3:56 PM To: 'POI Developers List' <dev@poi.apache.org> Subject: RE: got docx? Lol, just from looking through the code, and standard, there are a number of things that I know are not handled or not handled properly in XWPF. A quick subset from the top of my head includes: * Pictures that are not inlined in the main document, header, or footer parts. * Sections * SDT content * Alternate content * Many of the shared portions of the spec * Tables have problems * Versions - This is a tag that gets added to every node telling which save (version) it was created for. * Revisions - This is the stuff that tells what was changed and how. Which nodes were inserted, or changed, or deleted, or moved, and when, and by whom. There are thousands of hours left just to get it to version1 of the spec. But yes, thanks Dominik for providing this batch of test documents. It should help prioritize fixes. -----Original Message----- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, December 12, 2016 9:58 AM To: POI Developers List <dev@poi.apache.org> Cc: d...@tika.apache.org Subject: RE: got docx? To close the loop and share my gratitude publicly... Thank you, Dominik, for transferring 41k, 5GB of docx/dotx to our regression corpus! I’ve already found a number of “areas for improvement” in Tika's experimental docx SAX parser, and a few areas for improvement in POI's XWPFDocument/DOM parser…all thanks to your documents and your common crawl code. Thank you! Cheers, Tim B�KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB��[��X��ܚX�KK[XZ[ �]�][��X��ܚX�P�K�\X�K�ܙ�B��܈Y][ۘ[��[X[��K[XZ[ �]�Z[�K�\X�K�ܙ�B�B B�KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB��[��X��ܚX�KK[XZ[ �]�][��X��ܚX�P�K�\X�K�ܙ�B��܈Y][ۘ[��[X[��K[XZ[ �]�Z[�K�\X�K�ܙ�B�B