[ 
https://issues.apache.org/jira/browse/TIKA-4657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18058086#comment-18058086
 ] 

Tilman Hausherr commented on TIKA-4657:
---------------------------------------

I was able to get that content by using the SAXDocxExtractor. I can't tell if 
this is a bug or not, but at least now we know it's not a POI bug. The 
information does exist.

{code:java}
    @Test
    public void testTIKA4657() throws Exception {

        ParseContext parseContext = new ParseContext();
        OfficeParserConfig officeParserConfig = new OfficeParserConfig();
//        officeParserConfig.setIncludeHeadersAndFooters(true);
//        officeParserConfig.setIncludeMissingRows(true);
//        officeParserConfig.setExtractMacros(true);
//        officeParserConfig.setIncludeMoveFromContent(true);
//        officeParserConfig.setIncludeShapeBasedContent(true);
        officeParserConfig.setUseSAXDocxExtractor(true);
        parseContext.set(OfficeParserConfig.class, officeParserConfig);
        Metadata metadata = new Metadata();

        InputStream is = new 
FileInputStream("XXXXX/with_table_-_endnote_content_omitted.docx");
        XMLResult result = getXML(is, AUTO_DETECT_PARSER, metadata, 
parseContext);
        String xml = result.xml;
        metadata = result.metadata;
        System.out.println(xml);
        System.out.println("");
        System.out.println(metadata);
        System.out.println("");
        
assertEquals("application/vnd.openxmlformats-officedocument.wordprocessingml.document",
                metadata.get(Metadata.CONTENT_TYPE));
    }
{code}

> Endnote content in tables omitted from .docx text
> -------------------------------------------------
>
>                 Key: TIKA-4657
>                 URL: https://issues.apache.org/jira/browse/TIKA-4657
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 3.2.3
>            Reporter: Klara Mazurak
>            Priority: Major
>         Attachments: with_table_-_endnote_content_omitted.docx, 
> without_table_-_endnotes_work_correctly.docx
>
>
> If an endnote in a .docx file contains text in a table, that text is omitted 
> from Tika's text extraction.
> See the two attached files: the one without a table yields all the text as 
> expected, the one with the table does not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to