[
https://issues.apache.org/jira/browse/TIKA-405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061316#comment-13061316
]
isha marwah commented on TIKA-405:
----------------------------------
Can we know the status of this please?
> Problems handling Hyperlinks and Tables in Word 97 Docs
> -------------------------------------------------------
>
> Key: TIKA-405
> URL: https://issues.apache.org/jira/browse/TIKA-405
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.7
> Environment: 32-bit Ubuntu Linux
> Reporter: Curtis Warner
> Attachments: WordDocWithLinksAndTable.doc, actual.txt, expected.txt
>
>
> I discovered some odd behavior while running a three-way comparison test
> between Tika, Aperture, and Autonomy KeyView. The input file was a test Word
> 97 Doc (attached) including a paragraph peppered with hyperlinks and a table
> filled with dummy text. KeyView generated the full text, as I expected.
> Aperture and Tika had identical results to one another (barring one lost
> whitespace character), but their outputs yielded significantly fewer tokens
> than KeyView's did. I've attached the output text from KeyView and Tika for
> reference.
> There are two distinct problems I recognized in Tika's text output:
> 1) Hyperlinks from the Word Doc aren't included in the output text. They
> appear to have been skipped completely.
> 2) The values in the Word Doc's table are conglomerated all together into a
> single blob rather than being emitted separately, which ruins any attempt at
> tokenizing the table's contents.
> Seeing as both Tika and Aperture had exactly the same issues with this test
> file, my guess is that it's a problem with the shared POI library. I thought
> it would be worth noting, though, in case there's an easy fix on the Tika end
> of things.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira