[jira] [Commented] (TIKA-2441) Unable to extract text present in a table inside a textbox in MS Word

Tim Allison (JIRA) Thu, 10 Aug 2017 05:34:13 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121546#comment-16121546
 ]


Tim Allison commented on TIKA-2441:
-----------------------------------

Thank you for opening this issue.  I'll look into it at the POI level.  As a 
workaround, you can switch to our experimental SAX-based parser for DOCX, and 
that does work on this document.

In your tika-config file, do something like this:
{noformat}
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser"/>
        <parser class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser">
            <params>
                <param name="useSAXDocxExtractor" type="bool">true</param>
            </params>
        </parser>
    </parsers>
</properties>
{noformat}

Or if you want to use SolrJ:
{noformat}
        ParseContext pc = new ParseContext();
        OfficeParserConfig officeParserConfig = new OfficeParserConfig();
        officeParserConfig.setUseSAXDocxExtractor(true);
        pc.set(OfficeParserConfig.class, officeParserConfig);
        ...then add the pc to your call to parse...
{noformat}

> Unable to extract text present in a table inside a textbox in MS Word
> ---------------------------------------------------------------------
>
>                 Key: TIKA-2441
>                 URL: https://issues.apache.org/jira/browse/TIKA-2441
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.15
>         Environment: Windows, Linux, Apache tika 1.15 used with Apache 
> Solr-6.6.0
>            Reporter: Amit Humnabadkar
>              Labels: sax_docx_fixes
>         Attachments: doc001.zip
>
>
> Hello,
> I am using Tika-1.15 with Solr-6.6.0 to indexing and searching. This setup 
> fails to index text present in a table inside a textbox in a word document.
> A MS Word document contains two words - 
> 1. Germany - present in a table inside a textbox
> 2. Africa - present in a textbox
> Germany is not getting indexed while Africa gets indexed successfully. Looks 
> like Tika fails to extract the content present in table inside a textbox.
> Please have a look.
> Thanks,
> Amit Humnabadkar
> [^doc001.zip]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (TIKA-2441) Unable to extract text present in a table inside a textbox in MS Word

Reply via email to