[ 
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385463#comment-14385463
 ] 

Tim Allison commented on TIKA-1584:
-----------------------------------

Just checked svn. That's a major regression added in 1.7 when we added 
specification of ParseContext. We need to add the Parser to the ParseContext to 
get recursive parsing. W/o use of ParseContext in call to parse, the parser 
works recursively. Will fix Monday unless someone beats me to it. Thank you for 
raising this. No need to attach test doc.

> Tika 1.7 possible regression (nested attachment files not getting parsed)
> -------------------------------------------------------------------------
>
>                 Key: TIKA-1584
>                 URL: https://issues.apache.org/jira/browse/TIKA-1584
>             Project: Tika
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 1.7
>            Reporter: Rob Tulloh
>
> I tried to send this to the tika user list, but got a qmail failure so I am 
> opening a jira to see if I can get help with this.
> There appears to be a change in the behavior of tika since 1.5 (the last 
> version we have used). In 1.5, if we pass a file with content type of rfc822 
> which contains a zip that contains a docx file, the entire content would get 
> recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
> file and ignores the content of the contained docx file. This is causing a 
> regression failure in our search tests because the contents of the docx file 
> are not found when searched for.
>  
> We are testing with tika-server if this helps. If we ask the meta service to 
> just characterize the test data, it correctly determines the input is of type 
> rfc822. However, on extract, the contents of the attachment are not extracted 
> as expected.
> curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
> http://localhost:9998/meta 2>/dev/null | grep Content-Type
> "Content-Type","message/rfc822"
> curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
> http://localhost:9998/tika 2>/dev/null | grep docx
> sign.docx       <<<<--- this is not expected, need contents of this extracted
> We can easily reproduce this problem with a simple eml file with an 
> attachment. Can someone please comment if this seems like a problem or 
> perhaps we need to change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to