[ https://issues.apache.org/jira/browse/TIKA-405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison resolved TIKA-405. ------------------------------ Resolution: Fixed Fix Version/s: 1.5 This appears to be fixed with 1.5. Please reopen with test case if problem persists. > Problems handling Hyperlinks and Tables in Word 97 Docs > ------------------------------------------------------- > > Key: TIKA-405 > URL: https://issues.apache.org/jira/browse/TIKA-405 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.7 > Environment: 32-bit Ubuntu Linux > Reporter: Curtis Warner > Fix For: 1.5 > > Attachments: ASF.LICENSE.NOT.GRANTED--actual.txt, > ASF.LICENSE.NOT.GRANTED--expected.txt, > ASF.LICENSE.NOT.GRANTED--WordDocWithLinksAndTable.doc > > > I discovered some odd behavior while running a three-way comparison test > between Tika, Aperture, and Autonomy KeyView. The input file was a test Word > 97 Doc (attached) including a paragraph peppered with hyperlinks and a table > filled with dummy text. KeyView generated the full text, as I expected. > Aperture and Tika had identical results to one another (barring one lost > whitespace character), but their outputs yielded significantly fewer tokens > than KeyView's did. I've attached the output text from KeyView and Tika for > reference. > There are two distinct problems I recognized in Tika's text output: > 1) Hyperlinks from the Word Doc aren't included in the output text. They > appear to have been skipped completely. > 2) The values in the Word Doc's table are conglomerated all together into a > single blob rather than being emitted separately, which ruins any attempt at > tokenizing the table's contents. > Seeing as both Tika and Aperture had exactly the same issues with this test > file, my guess is that it's a problem with the shared POI library. I thought > it would be worth noting, though, in case there's an easy fix on the Tika end > of things. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira