[
https://issues.apache.org/jira/browse/TIKA-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17604801#comment-17604801
]
Tim Allison commented on TIKA-3816:
-----------------------------------
Just picking this up now. Thank you for submitting a triggering file.
The issue is that a table may contain not just (regular) rows, but also
CTSdtRow. We don't currently have access to these rows via the POI api, and we
don't even ship this class with the standard ooxml jar. I can extract this
text currently only with poi-ooxml-full on my class path.
> Tika cannot parse the text in the table(Microsoft word)
> -------------------------------------------------------
>
> Key: TIKA-3816
> URL: https://issues.apache.org/jira/browse/TIKA-3816
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 2.3.0
> Environment: OS : Windows 10,
> Software Platform : Java
> Reporter: Jason Guo
> Priority: Major
> Fix For: 2.4.2
>
> Attachments: output.PNG, test1.docx
>
>
> I am trying to parse a microsoft word document (.doc) which contains a table
> that contains a select component and a text.
> the code I am using for parsing the doc is below
> public static byte[] convertToByteArray(byte[] bytes) throws Exception {
> Tika tika = new Tika();
> if(bytes.length > tika.getMaxStringLength()) {
> tika.setMaxStringLength(bytes.length);
> }
> String result = tika.parseToString(new ByteArrayInputStream(bytes));
> byte[] rv = result.getBytes();
> return rv;
> }
> the dependencies I am using are
> compile ('org.apache.tika:tika-parsers-standard-package:2.3.0'){
> exclude group: 'org.apache.poi', module : 'poi-scratchpad'
> exclude group: 'org.apache.poi', module : 'poi'
> // exclude group: 'com.drewnoakes', module : 'metadata-extractor'
> }
> compile 'org.apache.tika:tika-core:2.3.0'
> compile 'org.apache.poi:poi-scratchpad:5.2.1'
> compile 'org.apache.poi:poi:5.2.1'
--
This message was sent by Atlassian Jira
(v8.20.10#820010)