RE: [jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)
Backwards compatibility issue found by clirr on TIKA-1587 [INFO] --- clirr-maven-plugin:2.3:check (default) @ tika-core --- [ERROR] org.apache.tika.fork.ForkParser: Return type of method 'public java.lang.String getJavaCommand()' has been changed to java.util.List [ERROR] org.apache.tika.fork.ForkParser: Parameter 1 of 'public void setJavaCommand(java.lang.String)' has changed its type to java.util.List -Original Message- From: Hudson (JIRA) [mailto:j...@apache.org] Sent: Monday, March 30, 2015 10:35 AM To: talli...@apache.org Subject: [jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed) [ https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386765#comment-14386765 ] Hudson commented on TIKA-1584: -- FAILURE: Integrated in tika-trunk-jdk1.7 #585 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/585/]) TIKA-1584: fixed regression in Tika 1.7 that prevents processing of embedded docs with /tika service (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670095) * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/MetadataResource.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/RecursiveMetadataResource.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaResourceTest.java Tika 1.7 possible regression (nested attachment files not getting parsed) - Key: TIKA-1584 URL: https://issues.apache.org/jira/browse/TIKA-1584 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.7 Reporter: Rob Tulloh Assignee: Tim Allison Priority: Blocker Fix For: 1.8 I tried to send this to the tika user list, but got a qmail failure so I am opening a jira to see if I can get help with this. There appears to be a change in the behavior of tika since 1.5 (the last version we have used). In 1.5, if we pass a file with content type of rfc822 which contains a zip that contains a docx file, the entire content would get recursed and the text returned. In 1.7, tika only unwinds as far as the zip file and ignores the content of the contained docx file. This is causing a regression failure in our search tests because the contents of the docx file are not found when searched for. We are testing with tika-server if this helps. If we ask the meta service to just characterize the test data, it correctly determines the input is of type rfc822. However, on extract, the contents of the attachment are not extracted as expected. curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/meta 2/dev/null | grep Content-Type Content-Type,message/rfc822 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/tika 2/dev/null | grep docx sign.docx --- this is not expected, need contents of this extracted We can easily reproduce this problem with a simple eml file with an attachment. Can someone please comment if this seems like a problem or perhaps we need to change something in our call to get the old behavior? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)
[ https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386765#comment-14386765 ] Hudson commented on TIKA-1584: -- FAILURE: Integrated in tika-trunk-jdk1.7 #585 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/585/]) TIKA-1584: fixed regression in Tika 1.7 that prevents processing of embedded docs with /tika service (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670095) * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/MetadataResource.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/RecursiveMetadataResource.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaResourceTest.java Tika 1.7 possible regression (nested attachment files not getting parsed) - Key: TIKA-1584 URL: https://issues.apache.org/jira/browse/TIKA-1584 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.7 Reporter: Rob Tulloh Assignee: Tim Allison Priority: Blocker Fix For: 1.8 I tried to send this to the tika user list, but got a qmail failure so I am opening a jira to see if I can get help with this. There appears to be a change in the behavior of tika since 1.5 (the last version we have used). In 1.5, if we pass a file with content type of rfc822 which contains a zip that contains a docx file, the entire content would get recursed and the text returned. In 1.7, tika only unwinds as far as the zip file and ignores the content of the contained docx file. This is causing a regression failure in our search tests because the contents of the docx file are not found when searched for. We are testing with tika-server if this helps. If we ask the meta service to just characterize the test data, it correctly determines the input is of type rfc822. However, on extract, the contents of the attachment are not extracted as expected. curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/meta 2/dev/null | grep Content-Type Content-Type,message/rfc822 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/tika 2/dev/null | grep docx sign.docx --- this is not expected, need contents of this extracted We can easily reproduce this problem with a simple eml file with an attachment. Can someone please comment if this seems like a problem or perhaps we need to change something in our call to get the old behavior? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)
[ https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386899#comment-14386899 ] Rob Tulloh commented on TIKA-1584: -- Thanks for the quick turn around to fixing this. Expected to release as a fix to 1.7? Tika 1.7 possible regression (nested attachment files not getting parsed) - Key: TIKA-1584 URL: https://issues.apache.org/jira/browse/TIKA-1584 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.7 Reporter: Rob Tulloh Assignee: Tim Allison Priority: Blocker Fix For: 1.8 I tried to send this to the tika user list, but got a qmail failure so I am opening a jira to see if I can get help with this. There appears to be a change in the behavior of tika since 1.5 (the last version we have used). In 1.5, if we pass a file with content type of rfc822 which contains a zip that contains a docx file, the entire content would get recursed and the text returned. In 1.7, tika only unwinds as far as the zip file and ignores the content of the contained docx file. This is causing a regression failure in our search tests because the contents of the docx file are not found when searched for. We are testing with tika-server if this helps. If we ask the meta service to just characterize the test data, it correctly determines the input is of type rfc822. However, on extract, the contents of the attachment are not extracted as expected. curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/meta 2/dev/null | grep Content-Type Content-Type,message/rfc822 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/tika 2/dev/null | grep docx sign.docx --- this is not expected, need contents of this extracted We can easily reproduce this problem with a simple eml file with an attachment. Can someone please comment if this seems like a problem or perhaps we need to change something in our call to get the old behavior? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)
[ https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385477#comment-14385477 ] Tim Allison commented on TIKA-1584: --- IMHO this is major enough for a fix asap. Whether that's 1.7.1 with just this fix or a full cut of trunk as 1.8 is up to all devs. Tika colleagues, what do you think? Tika 1.7 possible regression (nested attachment files not getting parsed) - Key: TIKA-1584 URL: https://issues.apache.org/jira/browse/TIKA-1584 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.7 Reporter: Rob Tulloh Assignee: Tim Allison Priority: Blocker I tried to send this to the tika user list, but got a qmail failure so I am opening a jira to see if I can get help with this. There appears to be a change in the behavior of tika since 1.5 (the last version we have used). In 1.5, if we pass a file with content type of rfc822 which contains a zip that contains a docx file, the entire content would get recursed and the text returned. In 1.7, tika only unwinds as far as the zip file and ignores the content of the contained docx file. This is causing a regression failure in our search tests because the contents of the docx file are not found when searched for. We are testing with tika-server if this helps. If we ask the meta service to just characterize the test data, it correctly determines the input is of type rfc822. However, on extract, the contents of the attachment are not extracted as expected. curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/meta 2/dev/null | grep Content-Type Content-Type,message/rfc822 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/tika 2/dev/null | grep docx sign.docx --- this is not expected, need contents of this extracted We can easily reproduce this problem with a simple eml file with an attachment. Can someone please comment if this seems like a problem or perhaps we need to change something in our call to get the old behavior? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)
[ https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385494#comment-14385494 ] Rob Tulloh commented on TIKA-1584: -- I would vote for a release as we have been waiting for tika-1371and were hoping to upgrade to 1.7 Tika 1.7 possible regression (nested attachment files not getting parsed) - Key: TIKA-1584 URL: https://issues.apache.org/jira/browse/TIKA-1584 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.7 Reporter: Rob Tulloh Assignee: Tim Allison Priority: Blocker I tried to send this to the tika user list, but got a qmail failure so I am opening a jira to see if I can get help with this. There appears to be a change in the behavior of tika since 1.5 (the last version we have used). In 1.5, if we pass a file with content type of rfc822 which contains a zip that contains a docx file, the entire content would get recursed and the text returned. In 1.7, tika only unwinds as far as the zip file and ignores the content of the contained docx file. This is causing a regression failure in our search tests because the contents of the docx file are not found when searched for. We are testing with tika-server if this helps. If we ask the meta service to just characterize the test data, it correctly determines the input is of type rfc822. However, on extract, the contents of the attachment are not extracted as expected. curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/meta 2/dev/null | grep Content-Type Content-Type,message/rfc822 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/tika 2/dev/null | grep docx sign.docx --- this is not expected, need contents of this extracted We can easily reproduce this problem with a simple eml file with an attachment. Can someone please comment if this seems like a problem or perhaps we need to change something in our call to get the old behavior? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)
[ https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385463#comment-14385463 ] Tim Allison commented on TIKA-1584: --- Just checked svn. That's a major regression added in 1.7 when we added specification of ParseContext. We need to add the Parser to the ParseContext to get recursive parsing. W/o use of ParseContext in call to parse, the parser works recursively. Will fix Monday unless someone beats me to it. Thank you for raising this. No need to attach test doc. Tika 1.7 possible regression (nested attachment files not getting parsed) - Key: TIKA-1584 URL: https://issues.apache.org/jira/browse/TIKA-1584 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.7 Reporter: Rob Tulloh I tried to send this to the tika user list, but got a qmail failure so I am opening a jira to see if I can get help with this. There appears to be a change in the behavior of tika since 1.5 (the last version we have used). In 1.5, if we pass a file with content type of rfc822 which contains a zip that contains a docx file, the entire content would get recursed and the text returned. In 1.7, tika only unwinds as far as the zip file and ignores the content of the contained docx file. This is causing a regression failure in our search tests because the contents of the docx file are not found when searched for. We are testing with tika-server if this helps. If we ask the meta service to just characterize the test data, it correctly determines the input is of type rfc822. However, on extract, the contents of the attachment are not extracted as expected. curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/meta 2/dev/null | grep Content-Type Content-Type,message/rfc822 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/tika 2/dev/null | grep docx sign.docx --- this is not expected, need contents of this extracted We can easily reproduce this problem with a simple eml file with an attachment. Can someone please comment if this seems like a problem or perhaps we need to change something in our call to get the old behavior? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)
[ https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385483#comment-14385483 ] Tyler Palsulich commented on TIKA-1584: --- We now have two major issues which need a quick release. So, I would say go for 1.8. Tim, can you chime in on the current discuss thread? Tika 1.7 possible regression (nested attachment files not getting parsed) - Key: TIKA-1584 URL: https://issues.apache.org/jira/browse/TIKA-1584 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.7 Reporter: Rob Tulloh Assignee: Tim Allison Priority: Blocker I tried to send this to the tika user list, but got a qmail failure so I am opening a jira to see if I can get help with this. There appears to be a change in the behavior of tika since 1.5 (the last version we have used). In 1.5, if we pass a file with content type of rfc822 which contains a zip that contains a docx file, the entire content would get recursed and the text returned. In 1.7, tika only unwinds as far as the zip file and ignores the content of the contained docx file. This is causing a regression failure in our search tests because the contents of the docx file are not found when searched for. We are testing with tika-server if this helps. If we ask the meta service to just characterize the test data, it correctly determines the input is of type rfc822. However, on extract, the contents of the attachment are not extracted as expected. curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/meta 2/dev/null | grep Content-Type Content-Type,message/rfc822 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/tika 2/dev/null | grep docx sign.docx --- this is not expected, need contents of this extracted We can easily reproduce this problem with a simple eml file with an attachment. Can someone please comment if this seems like a problem or perhaps we need to change something in our call to get the old behavior? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)
[ https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385472#comment-14385472 ] Rob Tulloh commented on TIKA-1584: -- Thank you. For what it's worth, it easy to reproduce. Just zip any document you want and then pass the zip file to tika server and see what it gives back. As 1.7 is released, does this mean that this won't be fixed until 1.8 or would 1.7 get re-released/patched? Tika 1.7 possible regression (nested attachment files not getting parsed) - Key: TIKA-1584 URL: https://issues.apache.org/jira/browse/TIKA-1584 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.7 Reporter: Rob Tulloh Priority: Blocker I tried to send this to the tika user list, but got a qmail failure so I am opening a jira to see if I can get help with this. There appears to be a change in the behavior of tika since 1.5 (the last version we have used). In 1.5, if we pass a file with content type of rfc822 which contains a zip that contains a docx file, the entire content would get recursed and the text returned. In 1.7, tika only unwinds as far as the zip file and ignores the content of the contained docx file. This is causing a regression failure in our search tests because the contents of the docx file are not found when searched for. We are testing with tika-server if this helps. If we ask the meta service to just characterize the test data, it correctly determines the input is of type rfc822. However, on extract, the contents of the attachment are not extracted as expected. curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/meta 2/dev/null | grep Content-Type Content-Type,message/rfc822 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/tika 2/dev/null | grep docx sign.docx --- this is not expected, need contents of this extracted We can easily reproduce this problem with a simple eml file with an attachment. Can someone please comment if this seems like a problem or perhaps we need to change something in our call to get the old behavior? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)
[ https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385328#comment-14385328 ] Rob Tulloh commented on TIKA-1584: -- If the .zip file is passed to tika, it shows the same behavior. {noformat} curl -X PUT -T sign.zip -H Content-Type:application/octet-stream http://localhost:9998/tika 2/dev/null sign.docx {noformat} Tika 1.7 possible regression (nested attachment files not getting parsed) - Key: TIKA-1584 URL: https://issues.apache.org/jira/browse/TIKA-1584 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.7 Reporter: Rob Tulloh I tried to send this to the tika user list, but got a qmail failure so I am opening a jira to see if I can get help with this. There appears to be a change in the behavior of tika since 1.5 (the last version we have used). In 1.5, if we pass a file with content type of rfc822 which contains a zip that contains a docx file, the entire content would get recursed and the text returned. In 1.7, tika only unwinds as far as the zip file and ignores the content of the contained docx file. This is causing a regression failure in our search tests because the contents of the docx file are not found when searched for. We are testing with tika-server if this helps. If we ask the meta service to just characterize the test data, it correctly determines the input is of type rfc822. However, on extract, the contents of the attachment are not extracted as expected. curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/meta 2/dev/null | grep Content-Type Content-Type,message/rfc822 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/tika 2/dev/null | grep docx sign.docx --- this is not expected, need contents of this extracted We can easily reproduce this problem with a simple eml file with an attachment. Can someone please comment if this seems like a problem or perhaps we need to change something in our call to get the old behavior? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)
[ https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385440#comment-14385440 ] Tim Allison commented on TIKA-1584: --- Able to attach example triggering doc? By same behavior do you mean that a docx inside a zip is not extracted with -X? Tika 1.7 possible regression (nested attachment files not getting parsed) - Key: TIKA-1584 URL: https://issues.apache.org/jira/browse/TIKA-1584 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.7 Reporter: Rob Tulloh I tried to send this to the tika user list, but got a qmail failure so I am opening a jira to see if I can get help with this. There appears to be a change in the behavior of tika since 1.5 (the last version we have used). In 1.5, if we pass a file with content type of rfc822 which contains a zip that contains a docx file, the entire content would get recursed and the text returned. In 1.7, tika only unwinds as far as the zip file and ignores the content of the contained docx file. This is causing a regression failure in our search tests because the contents of the docx file are not found when searched for. We are testing with tika-server if this helps. If we ask the meta service to just characterize the test data, it correctly determines the input is of type rfc822. However, on extract, the contents of the attachment are not extracted as expected. curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/meta 2/dev/null | grep Content-Type Content-Type,message/rfc822 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/tika 2/dev/null | grep docx sign.docx --- this is not expected, need contents of this extracted We can easily reproduce this problem with a simple eml file with an attachment. Can someone please comment if this seems like a problem or perhaps we need to change something in our call to get the old behavior? -- This message was sent by Atlassian JIRA (v6.3.4#6332)