[ 
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385328#comment-14385328
 ] 

Rob Tulloh commented on TIKA-1584:
----------------------------------

If the .zip file is passed to tika, it shows the same behavior.

{noformat}
curl -X PUT -T sign.zip -H Content-Type:application/octet-stream  
http://localhost:9998/tika 2>/dev/null

sign.docx

{noformat}

> Tika 1.7 possible regression (nested attachment files not getting parsed)
> -------------------------------------------------------------------------
>
>                 Key: TIKA-1584
>                 URL: https://issues.apache.org/jira/browse/TIKA-1584
>             Project: Tika
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 1.7
>            Reporter: Rob Tulloh
>
> I tried to send this to the tika user list, but got a qmail failure so I am 
> opening a jira to see if I can get help with this.
> There appears to be a change in the behavior of tika since 1.5 (the last 
> version we have used). In 1.5, if we pass a file with content type of rfc822 
> which contains a zip that contains a docx file, the entire content would get 
> recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
> file and ignores the content of the contained docx file. This is causing a 
> regression failure in our search tests because the contents of the docx file 
> are not found when searched for.
>  
> We are testing with tika-server if this helps. If we ask the meta service to 
> just characterize the test data, it correctly determines the input is of type 
> rfc822. However, on extract, the contents of the attachment are not extracted 
> as expected.
> curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
> http://localhost:9998/meta 2>/dev/null | grep Content-Type
> "Content-Type","message/rfc822"
> curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
> http://localhost:9998/tika 2>/dev/null | grep docx
> sign.docx       <<<<--- this is not expected, need contents of this extracted
> We can easily reproduce this problem with a simple eml file with an 
> attachment. Can someone please comment if this seems like a problem or 
> perhaps we need to change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to