[jira] [Commented] (TIKA-3816) Tika cannot parse the text in the table(Microsoft word)

Tim Allison (Jira) Wed, 14 Sep 2022 08:18:05 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17604803#comment-17604803
 ]


Tim Allison commented on TIKA-3816:
-----------------------------------

One workaround is to use the SAX docx parser, which yields this:

{noformat}
<head>
<meta name="cp:revision" content="28" />
<meta name="extended-properties:AppVersion" content="16.0000" />
<meta name="meta:paragraph-count" content="1" />
<meta name="meta:word-count" content="12" />
<meta name="extended-properties:Application" content="Microsoft Office Word" />
<meta name="meta:last-author" content="Jason (Yin) Guo" />
<meta name="dc:creator" content="Xiu Hui Loh" />
<meta name="extended-properties:Company" content="" />
<meta name="xmpTPg:NPages" content="1" />
<meta name="dcterms:created" content="2022-07-13T06:28:00Z" />
<meta name="meta:line-count" content="1" />
<meta name="dcterms:modified" content="2022-07-13T07:51:00Z" />
<meta name="meta:character-count" content="70" />
<meta name="extended-properties:Template" content="Normal.dotm" />
<meta name="meta:character-count-with-spaces" content="81" />
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-TIKA:Parsed-By" 
content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser" />
<meta name="X-TIKA:Parsed-By" 
content="org.apache.tika.parser.microsoft.ooxml.xwpf.XWPFEventBasedWordExtractor"
 />
<meta name="extended-properties:DocSecurityString" content="None" />
<meta name="extended-properties:TotalTime" content="10" />
<meta name="meta:page-count" content="1" />
<meta name="Content-Type" 
content="application/vnd.openxmlformats-officedocument.wordprocessingml.document"
 />
<meta name="dc:publisher" content="" />
<title></title>
</head>
<body><p>Sample 1:</p>

<table><tr>     <td>Note</td>   <td>Details</td></tr>
<tr><a name="_Hlk105063082" />  <td>Choose an item.</td>        <td>Here is 
just a sample</td></tr>
<tr>    <td />  <td /></tr>
<tr>    <td />  <td /></tr>
<tr>    <td />  <td /></tr>
</table>
<p />
<p />
<p />
<p />
<p />
<div class="glossary"><p>Choose an item.</p>
<p>Click or tap here to enter text.</p>
</div>
</body></html>
{noformat}

If you are calling Tika programmatically as you are in the example, you can 
need to set this via the parse config:
{noformat}
        ParseContext context = new ParseContext();
        OfficeParserConfig officeParserConfig = new OfficeParserConfig();
        officeParserConfig.setUseSAXDocxExtractor(true);
        context.set(OfficeParserConfig.class, officeParserConfig);
        debug(getRecursiveMetadata(p, context, true));
{noformat}

> Tika cannot parse the text in the table(Microsoft word)
> -------------------------------------------------------
>
>                 Key: TIKA-3816
>                 URL: https://issues.apache.org/jira/browse/TIKA-3816
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.3.0
>         Environment: OS : Windows 10,
> Software Platform : Java
>            Reporter: Jason Guo
>            Priority: Major
>             Fix For: 2.4.2
>
>         Attachments: output.PNG, test1.docx
>
>
> I am trying to parse a microsoft word document (.doc) which contains a table 
> that contains a select component and a text.
>  the code I am using for parsing the doc is below
> public static byte[] convertToByteArray(byte[] bytes) throws Exception {
> Tika tika = new Tika();
> if(bytes.length > tika.getMaxStringLength()) {
> tika.setMaxStringLength(bytes.length);
> }
> String result = tika.parseToString(new ByteArrayInputStream(bytes));
> byte[] rv = result.getBytes();
> return rv;
> }
> the dependencies I am using are
> compile ('org.apache.tika:tika-parsers-standard-package:2.3.0'){
> exclude group: 'org.apache.poi', module : 'poi-scratchpad'
> exclude group: 'org.apache.poi', module : 'poi'
> // exclude group: 'com.drewnoakes', module : 'metadata-extractor'
> }
> compile 'org.apache.tika:tika-core:2.3.0'
> compile 'org.apache.poi:poi-scratchpad:5.2.1'
> compile 'org.apache.poi:poi:5.2.1'



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3816) Tika cannot parse the text in the table(Microsoft word)

Reply via email to