[
https://issues.apache.org/jira/browse/TIKA-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17604803#comment-17604803
]
Tim Allison commented on TIKA-3816:
-----------------------------------
One workaround is to use the SAX docx parser, which yields this:
{noformat}
<head>
<meta name="cp:revision" content="28" />
<meta name="extended-properties:AppVersion" content="16.0000" />
<meta name="meta:paragraph-count" content="1" />
<meta name="meta:word-count" content="12" />
<meta name="extended-properties:Application" content="Microsoft Office Word" />
<meta name="meta:last-author" content="Jason (Yin) Guo" />
<meta name="dc:creator" content="Xiu Hui Loh" />
<meta name="extended-properties:Company" content="" />
<meta name="xmpTPg:NPages" content="1" />
<meta name="dcterms:created" content="2022-07-13T06:28:00Z" />
<meta name="meta:line-count" content="1" />
<meta name="dcterms:modified" content="2022-07-13T07:51:00Z" />
<meta name="meta:character-count" content="70" />
<meta name="extended-properties:Template" content="Normal.dotm" />
<meta name="meta:character-count-with-spaces" content="81" />
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-TIKA:Parsed-By"
content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser" />
<meta name="X-TIKA:Parsed-By"
content="org.apache.tika.parser.microsoft.ooxml.xwpf.XWPFEventBasedWordExtractor"
/>
<meta name="extended-properties:DocSecurityString" content="None" />
<meta name="extended-properties:TotalTime" content="10" />
<meta name="meta:page-count" content="1" />
<meta name="Content-Type"
content="application/vnd.openxmlformats-officedocument.wordprocessingml.document"
/>
<meta name="dc:publisher" content="" />
<title></title>
</head>
<body><p>Sample 1:</p>
<table><tr> <td>Note</td> <td>Details</td></tr>
<tr><a name="_Hlk105063082" /> <td>Choose an item.</td> <td>Here is
just a sample</td></tr>
<tr> <td /> <td /></tr>
<tr> <td /> <td /></tr>
<tr> <td /> <td /></tr>
</table>
<p />
<p />
<p />
<p />
<p />
<div class="glossary"><p>Choose an item.</p>
<p>Click or tap here to enter text.</p>
</div>
</body></html>
{noformat}
If you are calling Tika programmatically as you are in the example, you can
need to set this via the parse config:
{noformat}
ParseContext context = new ParseContext();
OfficeParserConfig officeParserConfig = new OfficeParserConfig();
officeParserConfig.setUseSAXDocxExtractor(true);
context.set(OfficeParserConfig.class, officeParserConfig);
debug(getRecursiveMetadata(p, context, true));
{noformat}
> Tika cannot parse the text in the table(Microsoft word)
> -------------------------------------------------------
>
> Key: TIKA-3816
> URL: https://issues.apache.org/jira/browse/TIKA-3816
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 2.3.0
> Environment: OS : Windows 10,
> Software Platform : Java
> Reporter: Jason Guo
> Priority: Major
> Fix For: 2.4.2
>
> Attachments: output.PNG, test1.docx
>
>
> I am trying to parse a microsoft word document (.doc) which contains a table
> that contains a select component and a text.
> the code I am using for parsing the doc is below
> public static byte[] convertToByteArray(byte[] bytes) throws Exception {
> Tika tika = new Tika();
> if(bytes.length > tika.getMaxStringLength()) {
> tika.setMaxStringLength(bytes.length);
> }
> String result = tika.parseToString(new ByteArrayInputStream(bytes));
> byte[] rv = result.getBytes();
> return rv;
> }
> the dependencies I am using are
> compile ('org.apache.tika:tika-parsers-standard-package:2.3.0'){
> exclude group: 'org.apache.poi', module : 'poi-scratchpad'
> exclude group: 'org.apache.poi', module : 'poi'
> // exclude group: 'com.drewnoakes', module : 'metadata-extractor'
> }
> compile 'org.apache.tika:tika-core:2.3.0'
> compile 'org.apache.poi:poi-scratchpad:5.2.1'
> compile 'org.apache.poi:poi:5.2.1'
--
This message was sent by Atlassian Jira
(v8.20.10#820010)