RE: [jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)

2015-03-30 Thread Allison, Timothy B.
Backwards compatibility issue found by clirr on TIKA-1587

[INFO] --- clirr-maven-plugin:2.3:check (default) @ tika-core ---

[ERROR] org.apache.tika.fork.ForkParser: Return type of method 'public 
java.lang.String getJavaCommand()' has been changed to java.util.List
[ERROR] org.apache.tika.fork.ForkParser: Parameter 1 of 'public void 
setJavaCommand(java.lang.String)' has changed its type to java.util.List

-Original Message-
From: Hudson (JIRA) [mailto:j...@apache.org] 
Sent: Monday, March 30, 2015 10:35 AM
To: talli...@apache.org
Subject: [jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested 
attachment files not getting parsed)


[ 
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386765#comment-14386765
 ] 

Hudson commented on TIKA-1584:
--

FAILURE: Integrated in tika-trunk-jdk1.7 #585 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/585/])
TIKA-1584: fixed regression in Tika 1.7 that prevents processing of embedded 
docs with /tika service (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670095)
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/MetadataResource.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/RecursiveMetadataResource.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaResourceTest.java


 Tika 1.7 possible regression (nested attachment files not getting parsed)
 -

 Key: TIKA-1584
 URL: https://issues.apache.org/jira/browse/TIKA-1584
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh
Assignee: Tim Allison
Priority: Blocker
 Fix For: 1.8


 I tried to send this to the tika user list, but got a qmail failure so I am 
 opening a jira to see if I can get help with this.
 There appears to be a change in the behavior of tika since 1.5 (the last 
 version we have used). In 1.5, if we pass a file with content type of rfc822 
 which contains a zip that contains a docx file, the entire content would get 
 recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
 file and ignores the content of the contained docx file. This is causing a 
 regression failure in our search tests because the contents of the docx file 
 are not found when searched for.
  
 We are testing with tika-server if this helps. If we ask the meta service to 
 just characterize the test data, it correctly determines the input is of type 
 rfc822. However, on extract, the contents of the attachment are not extracted 
 as expected.
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/meta 2/dev/null | grep Content-Type
 Content-Type,message/rfc822
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/tika 2/dev/null | grep docx
 sign.docx   --- this is not expected, need contents of this extracted
 We can easily reproduce this problem with a simple eml file with an 
 attachment. Can someone please comment if this seems like a problem or 
 perhaps we need to change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)

2015-03-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386765#comment-14386765
 ] 

Hudson commented on TIKA-1584:
--

FAILURE: Integrated in tika-trunk-jdk1.7 #585 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/585/])
TIKA-1584: fixed regression in Tika 1.7 that prevents processing of embedded 
docs with /tika service (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670095)
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/MetadataResource.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/RecursiveMetadataResource.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaResourceTest.java


 Tika 1.7 possible regression (nested attachment files not getting parsed)
 -

 Key: TIKA-1584
 URL: https://issues.apache.org/jira/browse/TIKA-1584
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh
Assignee: Tim Allison
Priority: Blocker
 Fix For: 1.8


 I tried to send this to the tika user list, but got a qmail failure so I am 
 opening a jira to see if I can get help with this.
 There appears to be a change in the behavior of tika since 1.5 (the last 
 version we have used). In 1.5, if we pass a file with content type of rfc822 
 which contains a zip that contains a docx file, the entire content would get 
 recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
 file and ignores the content of the contained docx file. This is causing a 
 regression failure in our search tests because the contents of the docx file 
 are not found when searched for.
  
 We are testing with tika-server if this helps. If we ask the meta service to 
 just characterize the test data, it correctly determines the input is of type 
 rfc822. However, on extract, the contents of the attachment are not extracted 
 as expected.
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/meta 2/dev/null | grep Content-Type
 Content-Type,message/rfc822
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/tika 2/dev/null | grep docx
 sign.docx   --- this is not expected, need contents of this extracted
 We can easily reproduce this problem with a simple eml file with an 
 attachment. Can someone please comment if this seems like a problem or 
 perhaps we need to change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)

2015-03-30 Thread Rob Tulloh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386899#comment-14386899
 ] 

Rob Tulloh commented on TIKA-1584:
--

Thanks for the quick turn around to fixing this. Expected to release as a fix 
to 1.7?

 Tika 1.7 possible regression (nested attachment files not getting parsed)
 -

 Key: TIKA-1584
 URL: https://issues.apache.org/jira/browse/TIKA-1584
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh
Assignee: Tim Allison
Priority: Blocker
 Fix For: 1.8


 I tried to send this to the tika user list, but got a qmail failure so I am 
 opening a jira to see if I can get help with this.
 There appears to be a change in the behavior of tika since 1.5 (the last 
 version we have used). In 1.5, if we pass a file with content type of rfc822 
 which contains a zip that contains a docx file, the entire content would get 
 recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
 file and ignores the content of the contained docx file. This is causing a 
 regression failure in our search tests because the contents of the docx file 
 are not found when searched for.
  
 We are testing with tika-server if this helps. If we ask the meta service to 
 just characterize the test data, it correctly determines the input is of type 
 rfc822. However, on extract, the contents of the attachment are not extracted 
 as expected.
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/meta 2/dev/null | grep Content-Type
 Content-Type,message/rfc822
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/tika 2/dev/null | grep docx
 sign.docx   --- this is not expected, need contents of this extracted
 We can easily reproduce this problem with a simple eml file with an 
 attachment. Can someone please comment if this seems like a problem or 
 perhaps we need to change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)

2015-03-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385477#comment-14385477
 ] 

Tim Allison commented on TIKA-1584:
---

IMHO this is major enough for a fix asap. Whether that's 1.7.1 with just this 
fix or a full cut of trunk as 1.8 is up to all devs. Tika colleagues, what do 
you think?

 Tika 1.7 possible regression (nested attachment files not getting parsed)
 -

 Key: TIKA-1584
 URL: https://issues.apache.org/jira/browse/TIKA-1584
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh
Assignee: Tim Allison
Priority: Blocker

 I tried to send this to the tika user list, but got a qmail failure so I am 
 opening a jira to see if I can get help with this.
 There appears to be a change in the behavior of tika since 1.5 (the last 
 version we have used). In 1.5, if we pass a file with content type of rfc822 
 which contains a zip that contains a docx file, the entire content would get 
 recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
 file and ignores the content of the contained docx file. This is causing a 
 regression failure in our search tests because the contents of the docx file 
 are not found when searched for.
  
 We are testing with tika-server if this helps. If we ask the meta service to 
 just characterize the test data, it correctly determines the input is of type 
 rfc822. However, on extract, the contents of the attachment are not extracted 
 as expected.
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/meta 2/dev/null | grep Content-Type
 Content-Type,message/rfc822
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/tika 2/dev/null | grep docx
 sign.docx   --- this is not expected, need contents of this extracted
 We can easily reproduce this problem with a simple eml file with an 
 attachment. Can someone please comment if this seems like a problem or 
 perhaps we need to change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)

2015-03-28 Thread Rob Tulloh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385494#comment-14385494
 ] 

Rob Tulloh commented on TIKA-1584:
--

I would vote for a release as we have been waiting for tika-1371and were hoping 
to upgrade to 1.7

 Tika 1.7 possible regression (nested attachment files not getting parsed)
 -

 Key: TIKA-1584
 URL: https://issues.apache.org/jira/browse/TIKA-1584
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh
Assignee: Tim Allison
Priority: Blocker

 I tried to send this to the tika user list, but got a qmail failure so I am 
 opening a jira to see if I can get help with this.
 There appears to be a change in the behavior of tika since 1.5 (the last 
 version we have used). In 1.5, if we pass a file with content type of rfc822 
 which contains a zip that contains a docx file, the entire content would get 
 recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
 file and ignores the content of the contained docx file. This is causing a 
 regression failure in our search tests because the contents of the docx file 
 are not found when searched for.
  
 We are testing with tika-server if this helps. If we ask the meta service to 
 just characterize the test data, it correctly determines the input is of type 
 rfc822. However, on extract, the contents of the attachment are not extracted 
 as expected.
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/meta 2/dev/null | grep Content-Type
 Content-Type,message/rfc822
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/tika 2/dev/null | grep docx
 sign.docx   --- this is not expected, need contents of this extracted
 We can easily reproduce this problem with a simple eml file with an 
 attachment. Can someone please comment if this seems like a problem or 
 perhaps we need to change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)

2015-03-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385463#comment-14385463
 ] 

Tim Allison commented on TIKA-1584:
---

Just checked svn. That's a major regression added in 1.7 when we added 
specification of ParseContext. We need to add the Parser to the ParseContext to 
get recursive parsing. W/o use of ParseContext in call to parse, the parser 
works recursively. Will fix Monday unless someone beats me to it. Thank you for 
raising this. No need to attach test doc.

 Tika 1.7 possible regression (nested attachment files not getting parsed)
 -

 Key: TIKA-1584
 URL: https://issues.apache.org/jira/browse/TIKA-1584
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh

 I tried to send this to the tika user list, but got a qmail failure so I am 
 opening a jira to see if I can get help with this.
 There appears to be a change in the behavior of tika since 1.5 (the last 
 version we have used). In 1.5, if we pass a file with content type of rfc822 
 which contains a zip that contains a docx file, the entire content would get 
 recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
 file and ignores the content of the contained docx file. This is causing a 
 regression failure in our search tests because the contents of the docx file 
 are not found when searched for.
  
 We are testing with tika-server if this helps. If we ask the meta service to 
 just characterize the test data, it correctly determines the input is of type 
 rfc822. However, on extract, the contents of the attachment are not extracted 
 as expected.
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/meta 2/dev/null | grep Content-Type
 Content-Type,message/rfc822
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/tika 2/dev/null | grep docx
 sign.docx   --- this is not expected, need contents of this extracted
 We can easily reproduce this problem with a simple eml file with an 
 attachment. Can someone please comment if this seems like a problem or 
 perhaps we need to change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)

2015-03-28 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385483#comment-14385483
 ] 

Tyler Palsulich commented on TIKA-1584:
---

We now have two major issues which need a quick release. So, I would say go for 
1.8. Tim, can you chime in on the current discuss thread?

 Tika 1.7 possible regression (nested attachment files not getting parsed)
 -

 Key: TIKA-1584
 URL: https://issues.apache.org/jira/browse/TIKA-1584
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh
Assignee: Tim Allison
Priority: Blocker

 I tried to send this to the tika user list, but got a qmail failure so I am 
 opening a jira to see if I can get help with this.
 There appears to be a change in the behavior of tika since 1.5 (the last 
 version we have used). In 1.5, if we pass a file with content type of rfc822 
 which contains a zip that contains a docx file, the entire content would get 
 recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
 file and ignores the content of the contained docx file. This is causing a 
 regression failure in our search tests because the contents of the docx file 
 are not found when searched for.
  
 We are testing with tika-server if this helps. If we ask the meta service to 
 just characterize the test data, it correctly determines the input is of type 
 rfc822. However, on extract, the contents of the attachment are not extracted 
 as expected.
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/meta 2/dev/null | grep Content-Type
 Content-Type,message/rfc822
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/tika 2/dev/null | grep docx
 sign.docx   --- this is not expected, need contents of this extracted
 We can easily reproduce this problem with a simple eml file with an 
 attachment. Can someone please comment if this seems like a problem or 
 perhaps we need to change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)

2015-03-28 Thread Rob Tulloh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385472#comment-14385472
 ] 

Rob Tulloh commented on TIKA-1584:
--

Thank you. For what it's worth, it easy to reproduce. Just zip any document you 
want and then pass the zip file to tika server and see what it gives back. As 
1.7 is released, does this mean that this won't be fixed until 1.8 or would 1.7 
get re-released/patched?

 Tika 1.7 possible regression (nested attachment files not getting parsed)
 -

 Key: TIKA-1584
 URL: https://issues.apache.org/jira/browse/TIKA-1584
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh
Priority: Blocker

 I tried to send this to the tika user list, but got a qmail failure so I am 
 opening a jira to see if I can get help with this.
 There appears to be a change in the behavior of tika since 1.5 (the last 
 version we have used). In 1.5, if we pass a file with content type of rfc822 
 which contains a zip that contains a docx file, the entire content would get 
 recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
 file and ignores the content of the contained docx file. This is causing a 
 regression failure in our search tests because the contents of the docx file 
 are not found when searched for.
  
 We are testing with tika-server if this helps. If we ask the meta service to 
 just characterize the test data, it correctly determines the input is of type 
 rfc822. However, on extract, the contents of the attachment are not extracted 
 as expected.
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/meta 2/dev/null | grep Content-Type
 Content-Type,message/rfc822
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/tika 2/dev/null | grep docx
 sign.docx   --- this is not expected, need contents of this extracted
 We can easily reproduce this problem with a simple eml file with an 
 attachment. Can someone please comment if this seems like a problem or 
 perhaps we need to change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)

2015-03-28 Thread Rob Tulloh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385328#comment-14385328
 ] 

Rob Tulloh commented on TIKA-1584:
--

If the .zip file is passed to tika, it shows the same behavior.

{noformat}
curl -X PUT -T sign.zip -H Content-Type:application/octet-stream  
http://localhost:9998/tika 2/dev/null

sign.docx

{noformat}

 Tika 1.7 possible regression (nested attachment files not getting parsed)
 -

 Key: TIKA-1584
 URL: https://issues.apache.org/jira/browse/TIKA-1584
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh

 I tried to send this to the tika user list, but got a qmail failure so I am 
 opening a jira to see if I can get help with this.
 There appears to be a change in the behavior of tika since 1.5 (the last 
 version we have used). In 1.5, if we pass a file with content type of rfc822 
 which contains a zip that contains a docx file, the entire content would get 
 recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
 file and ignores the content of the contained docx file. This is causing a 
 regression failure in our search tests because the contents of the docx file 
 are not found when searched for.
  
 We are testing with tika-server if this helps. If we ask the meta service to 
 just characterize the test data, it correctly determines the input is of type 
 rfc822. However, on extract, the contents of the attachment are not extracted 
 as expected.
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/meta 2/dev/null | grep Content-Type
 Content-Type,message/rfc822
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/tika 2/dev/null | grep docx
 sign.docx   --- this is not expected, need contents of this extracted
 We can easily reproduce this problem with a simple eml file with an 
 attachment. Can someone please comment if this seems like a problem or 
 perhaps we need to change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)

2015-03-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385440#comment-14385440
 ] 

Tim Allison commented on TIKA-1584:
---

Able to attach example triggering doc?  By same behavior do you mean that a 
docx inside a zip is not extracted with -X?

 Tika 1.7 possible regression (nested attachment files not getting parsed)
 -

 Key: TIKA-1584
 URL: https://issues.apache.org/jira/browse/TIKA-1584
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh

 I tried to send this to the tika user list, but got a qmail failure so I am 
 opening a jira to see if I can get help with this.
 There appears to be a change in the behavior of tika since 1.5 (the last 
 version we have used). In 1.5, if we pass a file with content type of rfc822 
 which contains a zip that contains a docx file, the entire content would get 
 recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
 file and ignores the content of the contained docx file. This is causing a 
 regression failure in our search tests because the contents of the docx file 
 are not found when searched for.
  
 We are testing with tika-server if this helps. If we ask the meta service to 
 just characterize the test data, it correctly determines the input is of type 
 rfc822. However, on extract, the contents of the attachment are not extracted 
 as expected.
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/meta 2/dev/null | grep Content-Type
 Content-Type,message/rfc822
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/tika 2/dev/null | grep docx
 sign.docx   --- this is not expected, need contents of this extracted
 We can easily reproduce this problem with a simple eml file with an 
 attachment. Can someone please comment if this seems like a problem or 
 perhaps we need to change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)