[
https://issues.apache.org/jira/browse/TIKA-4657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18058086#comment-18058086
]
Tilman Hausherr commented on TIKA-4657:
---------------------------------------
I was able to get that content by using the SAXDocxExtractor. I can't tell if
this is a bug or not, but at least now we know it's not a POI bug. The
information does exist.
{code:java}
@Test
public void testTIKA4657() throws Exception {
ParseContext parseContext = new ParseContext();
OfficeParserConfig officeParserConfig = new OfficeParserConfig();
// officeParserConfig.setIncludeHeadersAndFooters(true);
// officeParserConfig.setIncludeMissingRows(true);
// officeParserConfig.setExtractMacros(true);
// officeParserConfig.setIncludeMoveFromContent(true);
// officeParserConfig.setIncludeShapeBasedContent(true);
officeParserConfig.setUseSAXDocxExtractor(true);
parseContext.set(OfficeParserConfig.class, officeParserConfig);
Metadata metadata = new Metadata();
InputStream is = new
FileInputStream("XXXXX/with_table_-_endnote_content_omitted.docx");
XMLResult result = getXML(is, AUTO_DETECT_PARSER, metadata,
parseContext);
String xml = result.xml;
metadata = result.metadata;
System.out.println(xml);
System.out.println("");
System.out.println(metadata);
System.out.println("");
assertEquals("application/vnd.openxmlformats-officedocument.wordprocessingml.document",
metadata.get(Metadata.CONTENT_TYPE));
}
{code}
> Endnote content in tables omitted from .docx text
> -------------------------------------------------
>
> Key: TIKA-4657
> URL: https://issues.apache.org/jira/browse/TIKA-4657
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 3.2.3
> Reporter: Klara Mazurak
> Priority: Major
> Attachments: with_table_-_endnote_content_omitted.docx,
> without_table_-_endnotes_work_correctly.docx
>
>
> If an endnote in a .docx file contains text in a table, that text is omitted
> from Tika's text extraction.
> See the two attached files: the one without a table yields all the text as
> expected, the one with the table does not.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)