Tim Allison commented on TIKA-2441:

Thank you for opening this issue.  I'll look into it at the POI level.  As a 
workaround, you can switch to our experimental SAX-based parser for DOCX, and 
that does work on this document.

In your tika-config file, do something like this:
        <parser class="org.apache.tika.parser.DefaultParser"/>
        <parser class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser">
                <param name="useSAXDocxExtractor" type="bool">true</param>

Or if you want to use SolrJ:
        ParseContext pc = new ParseContext();
        OfficeParserConfig officeParserConfig = new OfficeParserConfig();
        pc.set(OfficeParserConfig.class, officeParserConfig);
        ...then add the pc to your call to parse...

> Unable to extract text present in a table inside a textbox in MS Word
> ---------------------------------------------------------------------
>                 Key: TIKA-2441
>                 URL: https://issues.apache.org/jira/browse/TIKA-2441
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.15
>         Environment: Windows, Linux, Apache tika 1.15 used with Apache 
> Solr-6.6.0
>            Reporter: Amit Humnabadkar
>              Labels: sax_docx_fixes
>         Attachments: doc001.zip
> Hello,
> I am using Tika-1.15 with Solr-6.6.0 to indexing and searching. This setup 
> fails to index text present in a table inside a textbox in a word document.
> A MS Word document contains two words - 
> 1. Germany - present in a table inside a textbox
> 2. Africa - present in a textbox
> Germany is not getting indexed while Africa gets indexed successfully. Looks 
> like Tika fails to extract the content present in table inside a textbox.
> Please have a look.
> Thanks,
> Amit Humnabadkar
> [^doc001.zip]

This message was sent by Atlassian JIRA

Reply via email to