[
https://issues.apache.org/jira/browse/TIKA-405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison resolved TIKA-405.
------------------------------
Resolution: Fixed
Fix Version/s: 1.5
This appears to be fixed with 1.5. Please reopen with test case if problem
persists.
> Problems handling Hyperlinks and Tables in Word 97 Docs
> -------------------------------------------------------
>
> Key: TIKA-405
> URL: https://issues.apache.org/jira/browse/TIKA-405
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.7
> Environment: 32-bit Ubuntu Linux
> Reporter: Curtis Warner
> Fix For: 1.5
>
> Attachments: ASF.LICENSE.NOT.GRANTED--actual.txt,
> ASF.LICENSE.NOT.GRANTED--expected.txt,
> ASF.LICENSE.NOT.GRANTED--WordDocWithLinksAndTable.doc
>
>
> I discovered some odd behavior while running a three-way comparison test
> between Tika, Aperture, and Autonomy KeyView. The input file was a test Word
> 97 Doc (attached) including a paragraph peppered with hyperlinks and a table
> filled with dummy text. KeyView generated the full text, as I expected.
> Aperture and Tika had identical results to one another (barring one lost
> whitespace character), but their outputs yielded significantly fewer tokens
> than KeyView's did. I've attached the output text from KeyView and Tika for
> reference.
> There are two distinct problems I recognized in Tika's text output:
> 1) Hyperlinks from the Word Doc aren't included in the output text. They
> appear to have been skipped completely.
> 2) The values in the Word Doc's table are conglomerated all together into a
> single blob rather than being emitted separately, which ruins any attempt at
> tokenizing the table's contents.
> Seeing as both Tika and Aperture had exactly the same issues with this test
> file, my guess is that it's a problem with the shared POI library. I thought
> it would be worth noting, though, in case there's an easy fix on the Tika end
> of things.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira