[
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385440#comment-14385440
]
Tim Allison commented on TIKA-1584:
-----------------------------------
Able to attach example triggering doc? By "same behavior" do you mean that a
docx inside a zip is not extracted with -X?
> Tika 1.7 possible regression (nested attachment files not getting parsed)
> -------------------------------------------------------------------------
>
> Key: TIKA-1584
> URL: https://issues.apache.org/jira/browse/TIKA-1584
> Project: Tika
> Issue Type: Bug
> Components: server
> Affects Versions: 1.7
> Reporter: Rob Tulloh
>
> I tried to send this to the tika user list, but got a qmail failure so I am
> opening a jira to see if I can get help with this.
> There appears to be a change in the behavior of tika since 1.5 (the last
> version we have used). In 1.5, if we pass a file with content type of rfc822
> which contains a zip that contains a docx file, the entire content would get
> recursed and the text returned. In 1.7, tika only unwinds as far as the zip
> file and ignores the content of the contained docx file. This is causing a
> regression failure in our search tests because the contents of the docx file
> are not found when searched for.
>
> We are testing with tika-server if this helps. If we ask the meta service to
> just characterize the test data, it correctly determines the input is of type
> rfc822. However, on extract, the contents of the attachment are not extracted
> as expected.
> curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream
> http://localhost:9998/meta 2>/dev/null | grep Content-Type
> "Content-Type","message/rfc822"
> curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream
> http://localhost:9998/tika 2>/dev/null | grep docx
> sign.docx <<<<--- this is not expected, need contents of this extracted
> We can easily reproduce this problem with a simple eml file with an
> attachment. Can someone please comment if this seems like a problem or
> perhaps we need to change something in our call to get the old behavior?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)