[ 
https://issues.apache.org/jira/browse/TIKA-405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061327#comment-13061327
 ] 

Nick Burch commented on TIKA-405:
---------------------------------

I think my comment from May last year still stands, this will likely need some 
bug fixing (and possibly new functionality) in HWPF in POI, and the first step 
is to open a bug in the POI bugzilla and upload some simple files to help test 
against.

> Problems handling Hyperlinks and Tables in Word 97 Docs
> -------------------------------------------------------
>
>                 Key: TIKA-405
>                 URL: https://issues.apache.org/jira/browse/TIKA-405
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: 32-bit Ubuntu Linux
>            Reporter: Curtis Warner
>         Attachments: WordDocWithLinksAndTable.doc, actual.txt, expected.txt
>
>
> I discovered some odd behavior while running a three-way comparison test 
> between Tika, Aperture, and Autonomy KeyView. The input file was a test Word 
> 97 Doc (attached) including a paragraph peppered with hyperlinks and a table 
> filled with dummy text. KeyView generated the full text, as I expected. 
> Aperture and Tika had identical results to one another (barring one lost 
> whitespace character), but their outputs yielded significantly fewer tokens 
> than KeyView's did. I've attached the output text from KeyView and Tika for 
> reference.
> There are two distinct problems I recognized in Tika's text output:
> 1) Hyperlinks from the Word Doc aren't included in the output text. They 
> appear to have been skipped completely.
> 2) The values in the Word Doc's table are conglomerated all together into a 
> single blob rather than being emitted separately, which ruins any attempt at 
> tokenizing the table's contents.
> Seeing as both Tika and Aperture had exactly the same issues with this test 
> file, my guess is that it's a problem with the shared POI library. I thought 
> it would be worth noting, though, in case there's an easy fix on the Tika end 
> of things.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to