[ 
https://issues.apache.org/jira/browse/TIKA-405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-405.
------------------------------

       Resolution: Fixed
    Fix Version/s: 1.5

This appears to be fixed with 1.5.  Please reopen with test case if problem 
persists.
                
> Problems handling Hyperlinks and Tables in Word 97 Docs
> -------------------------------------------------------
>
>                 Key: TIKA-405
>                 URL: https://issues.apache.org/jira/browse/TIKA-405
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: 32-bit Ubuntu Linux
>            Reporter: Curtis Warner
>             Fix For: 1.5
>
>         Attachments: ASF.LICENSE.NOT.GRANTED--actual.txt, 
> ASF.LICENSE.NOT.GRANTED--expected.txt, 
> ASF.LICENSE.NOT.GRANTED--WordDocWithLinksAndTable.doc
>
>
> I discovered some odd behavior while running a three-way comparison test 
> between Tika, Aperture, and Autonomy KeyView. The input file was a test Word 
> 97 Doc (attached) including a paragraph peppered with hyperlinks and a table 
> filled with dummy text. KeyView generated the full text, as I expected. 
> Aperture and Tika had identical results to one another (barring one lost 
> whitespace character), but their outputs yielded significantly fewer tokens 
> than KeyView's did. I've attached the output text from KeyView and Tika for 
> reference.
> There are two distinct problems I recognized in Tika's text output:
> 1) Hyperlinks from the Word Doc aren't included in the output text. They 
> appear to have been skipped completely.
> 2) The values in the Word Doc's table are conglomerated all together into a 
> single blob rather than being emitted separately, which ruins any attempt at 
> tokenizing the table's contents.
> Seeing as both Tika and Aperture had exactly the same issues with this test 
> file, my guess is that it's a problem with the shared POI library. I thought 
> it would be worth noting, though, in case there's an easy fix on the Tika end 
> of things.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to