[jira] [Comment Edited] (TIKA-1997) Problem in Tika().detect for xml file signed in CADES

2019-01-18 Thread Roberto Benedetti (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16746737#comment-16746737
 ] 

Roberto Benedetti edited comment on TIKA-1997 at 1/18/19 11:23 PM:
---

Updated references are:
 * [RFC-5652, Cryptographic Message Syntax 
(CMS)|https://tools.ietf.org/html/rfc5652]
 * [RFC-5751, Secure/Multipurpose Internet Mail Extensions (S/MIME) Version 3.2 
Message Specification|https://tools.ietf.org/html/rfc5751]
 * [RFC-7468, Textual Encodings of PKIX, PKCS, and CMS 
Structures|https://tools.ietf.org/html/rfc7468]

Tika looks for any "pkcs7" OID at the beginning of the file and, if found, 
returns "application/pkcs7-signature".

The OIDs that should be looked for are "pkcs7-signedData", 
"pkcs7-envelopedData" and "id-smime-ct-compressedData".

There are three media types with "pkcs7-signedData" at the beginning, namely:
 * "application/pkcs7-signature", extention ".p7s",  when the signed content is 
not present (detached signature)
 * "application/pkcs7-mime; smime-type=signed-data", extension ".p7m", when the 
signed content is present
 * "application/pkcs7-mime; smime-type=certs-only", extension ".p7c", when 
there are only certificates and (optionally) CRLs

When the OID is "pkcs7-envelopedData" the media type is 
"application/pkcs7-mime; smime-type=enveloped-data" and the extension is ".p7m".

When the OID is "id-smime-ct-compressedData" the media type is 
"application/pkcs7-mime; smime-type=compressed-data" and the extension is 
".p7z".

Extension ".p7b" is registered in Tika with media type 
"application/x-pkcs7-certificates" but I think the content of such files is the 
same as ".p7c" ones.

Furthermore the label in the textual encoding is always PKCS7 (i.e. the file 
begins with "-BEGIN PKCS7").

I can provide examples, built using openssl, but to support those media types 
Tika shall:
 * return parameters in media type when detecting streams
 * return different extensions based on media type parameters
 * further inspect streams when "-BEGIN PKCS7" or "pkcs7-signedData" are 
found (like it does for XML streams)
 * register "application/pkcs7-signature" as sub-class of 
"application/pkcs7-mime"
 * register "application/x-pkcs7-certificates" as an alias of 
"application/pkcs7-mime"

 


was (Author: roberto.benedetti):
Updated references are:
 * [RFC-5652, Cryptographic Message Syntax 
(CMS)|https://tools.ietf.org/html/rfc5652]
 * [RFC-5751, Secure/Multipurpose Internet Mail Extensions (S/MIME) Version 3.2 
Message Specification|https://tools.ietf.org/html/rfc5751]
 * [RFC-7468, Textual Encodings of PKIX, PKCS, and CMS 
Structures|https://tools.ietf.org/html/rfc7468]

Tika looks for any "pkcs7" OID at the beginning of the file and, if found, 
returns "application/pkcs7-signature".

The OIDs that should be looked for are "pkcs7-signedData", 
"pkcs7-envelopedData" and "id-smime-ct-compressedData".

There are three media types with "pkcs7-signedData" at the beginning, namely:
 * "application/pkcs7-signature", extention ".p7s",  when the signed content is 
not present (detached signature)
 * "application/pkcs7-mime; smime-type=signed-data", extension ".p7m", when the 
signed content is present
 * "application/pkcs7-mime; smime-type=certs-only", extension ".p7c", when 
there are only certificates and (optionally) CRLs

When the OID is "pkcs7-envelopedData" the media type is 
"application/pkcs7-mime; smime-type=enveloped-data" and the extension is ".p7m".

When the OID is "id-smime-ct-compressedData" the media type is 
"application/pkcs7-mime; smime-type=compressed-data" and the extension is 
".p7z".

Extension ".p7b" is registered in Tika with media type 
"application/x-pkcs7-certificates" but I think the content of such files is the 
same as ".p7c" ones.

Furthermore the label in the textual encoding is always PKCS7 (i.e. the file 
begins with "-BEGIN PKCS7").

I can provide examples, built using openssl, but to support those media types 
Tika shall:
 * return parameters in media type when detecting streams
 * return different extensions based on media type parameters
 * further inspect streams when "-BEGIN PKCS7" or "pkcs7-signedData" are 
found (like it does for XML streams)
 * register "application/pkcs7-signature" as sub-class of 
"application/pkcs7-mime"
 * remove "application/x-pkcs7-certificates"

 

> Problem in Tika().detect for xml file signed in CADES
> -
>
> Key: TIKA-1997
> URL: https://issues.apache.org/jira/browse/TIKA-1997
> Project: Tika
>  Issue Type: Sub-task
>  Components: detector
>Affects Versions: 1.13
> Environment: JDK 1.7
>Reporter: Michele Andreano
>Priority: Blocker
> Attachments: test.xml.p7m
>
>
> When I submit a tika a xml file signed in P7M format, I expect ti

[jira] [Comment Edited] (TIKA-1997) Problem in Tika().detect for xml file signed in CADES

2019-01-18 Thread Roberto Benedetti (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16746737#comment-16746737
 ] 

Roberto Benedetti edited comment on TIKA-1997 at 1/18/19 11:24 PM:
---

Updated references are:
 * [RFC-5652, Cryptographic Message Syntax 
(CMS)|https://tools.ietf.org/html/rfc5652]
 * [RFC-5751, Secure/Multipurpose Internet Mail Extensions (S/MIME) Version 3.2 
Message Specification|https://tools.ietf.org/html/rfc5751]
 * [RFC-7468, Textual Encodings of PKIX, PKCS, and CMS 
Structures|https://tools.ietf.org/html/rfc7468]

Tika looks for any "pkcs7" OID at the beginning of the file and, if found, 
returns "application/pkcs7-signature".

The OIDs that should be looked for are "pkcs7-signedData", 
"pkcs7-envelopedData" and "id-smime-ct-compressedData".

There are three media types with "pkcs7-signedData" at the beginning, namely:
 * "application/pkcs7-signature", extention ".p7s",  when the signed content is 
not present (detached signature)
 * "application/pkcs7-mime; smime-type=signed-data", extension ".p7m", when the 
signed content is present
 * "application/pkcs7-mime; smime-type=certs-only", extension ".p7c", when 
there are only certificates and (optionally) CRLs

When the OID is "pkcs7-envelopedData" the media type is 
"application/pkcs7-mime; smime-type=enveloped-data" and the extension is ".p7m".

When the OID is "id-smime-ct-compressedData" the media type is 
"application/pkcs7-mime; smime-type=compressed-data" and the extension is 
".p7z".

Extension ".p7b" is registered in Tika with media type 
"application/x-pkcs7-certificates" but I think the content of such files is the 
same as ".p7c" ones.

Furthermore the label in the textual encoding is always PKCS7 (i.e. the file 
begins with "-BEGIN PKCS7").

I can provide examples, built using openssl, but to support those media types 
Tika shall:
 * return parameters in media type when detecting streams
 * return different extensions based on media type parameters
 * further inspect streams when "-BEGIN PKCS7" or "pkcs7-signedData" are 
found (like it does for XML streams)
 * register "application/pkcs7-signature" as sub-class of 
"application/pkcs7-mime"
 * register "application/x-pkcs7-certificates" as an alias of 
"application/pkcs7-mime; smime-type=certs-only"

 


was (Author: roberto.benedetti):
Updated references are:
 * [RFC-5652, Cryptographic Message Syntax 
(CMS)|https://tools.ietf.org/html/rfc5652]
 * [RFC-5751, Secure/Multipurpose Internet Mail Extensions (S/MIME) Version 3.2 
Message Specification|https://tools.ietf.org/html/rfc5751]
 * [RFC-7468, Textual Encodings of PKIX, PKCS, and CMS 
Structures|https://tools.ietf.org/html/rfc7468]

Tika looks for any "pkcs7" OID at the beginning of the file and, if found, 
returns "application/pkcs7-signature".

The OIDs that should be looked for are "pkcs7-signedData", 
"pkcs7-envelopedData" and "id-smime-ct-compressedData".

There are three media types with "pkcs7-signedData" at the beginning, namely:
 * "application/pkcs7-signature", extention ".p7s",  when the signed content is 
not present (detached signature)
 * "application/pkcs7-mime; smime-type=signed-data", extension ".p7m", when the 
signed content is present
 * "application/pkcs7-mime; smime-type=certs-only", extension ".p7c", when 
there are only certificates and (optionally) CRLs

When the OID is "pkcs7-envelopedData" the media type is 
"application/pkcs7-mime; smime-type=enveloped-data" and the extension is ".p7m".

When the OID is "id-smime-ct-compressedData" the media type is 
"application/pkcs7-mime; smime-type=compressed-data" and the extension is 
".p7z".

Extension ".p7b" is registered in Tika with media type 
"application/x-pkcs7-certificates" but I think the content of such files is the 
same as ".p7c" ones.

Furthermore the label in the textual encoding is always PKCS7 (i.e. the file 
begins with "-BEGIN PKCS7").

I can provide examples, built using openssl, but to support those media types 
Tika shall:
 * return parameters in media type when detecting streams
 * return different extensions based on media type parameters
 * further inspect streams when "-BEGIN PKCS7" or "pkcs7-signedData" are 
found (like it does for XML streams)
 * register "application/pkcs7-signature" as sub-class of 
"application/pkcs7-mime"
 * register "application/x-pkcs7-certificates" as an alias of 
"application/pkcs7-mime"

 

> Problem in Tika().detect for xml file signed in CADES
> -
>
> Key: TIKA-1997
> URL: https://issues.apache.org/jira/browse/TIKA-1997
> Project: Tika
>  Issue Type: Sub-task
>  Components: detector
>Affects Versions: 1.13
> Environment: JDK 1.7
>Reporter: Michele Andreano
>Priority: Blocker
> Attachments: test.xml.p7m
>
>
>

[jira] [Comment Edited] (TIKA-1997) Problem in Tika().detect for xml file signed in CADES

2019-01-18 Thread Roberto Benedetti (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16746737#comment-16746737
 ] 

Roberto Benedetti edited comment on TIKA-1997 at 1/18/19 11:14 PM:
---

Updated references are:
 * [RFC-5652, Cryptographic Message Syntax 
(CMS)|https://tools.ietf.org/html/rfc5652]
 * [RFC-5751, Secure/Multipurpose Internet Mail Extensions (S/MIME) Version 3.2 
Message Specification|https://tools.ietf.org/html/rfc5751]
 * [RFC-7468, Textual Encodings of PKIX, PKCS, and CMS 
Structures|https://tools.ietf.org/html/rfc7468]

Tika looks for any "pkcs7" OID at the beginning of the file and, if found, 
returns "application/pkcs7-signature".

The OIDs that should be looked for are "pkcs7-signedData", 
"pkcs7-envelopedData" and "id-smime-ct-compressedData".

There are three media types with "pkcs7-signedData" at the beginning, namely:
 * "application/pkcs7-signature", extention ".p7s",  when the signed content is 
not present (detached signature)
 * "application/pkcs7-mime; smime-type=signed-data", extension ".p7m", when the 
signed content is present
 * "application/pkcs7-mime; smime-type=certs-only", extension ".p7c", when 
there are only certificates and (optionally) CRLs

When the OID is "pkcs7-envelopedData" the media type is 
"application/pkcs7-mime; smime-type=enveloped-data" and the extension is ".p7m".

When the OID is "id-smime-ct-compressedData" the media type is 
"application/pkcs7-mime; smime-type=compressed-data" and the extension is 
".p7z".

Extension ".p7b" is registered in Tika with media type 
"application/x-pkcs7-certificates" but I think the content of such files is the 
same as ".p7c" ones.

Furthermore the label in the textual encoding is always PKCS7 (i.e. the file 
begins with "-BEGIN PKCS7").

I can provide examples, built using openssl, but to support those media types 
Tika shall:
 * return parameters in media type when detecting streams
 * return different extensions based on media type parameters
 * further inspect streams when "-BEGIN PKCS7" or "pkcs7-signedData" are 
found (like it does for XML streams)
 * register "application/pkcs7-signature" as sub-class of 
"application/pkcs7-mime"

 


was (Author: roberto.benedetti):
Updated references are:
 * [RFC-5652, Cryptographic Message Syntax 
(CMS)|https://tools.ietf.org/html/rfc5652]
 * [RFC-5751, Secure/Multipurpose Internet Mail Extensions (S/MIME) Version 3.2 
Message Specification|https://tools.ietf.org/html/rfc5751]
 * [RFC-7468, Textual Encodings of PKIX, PKCS, and CMS 
Structures|https://tools.ietf.org/html/rfc7468]

Tika looks for any "pkcs7" OID at the beginning of the file and, if found, 
returns "application/pkcs7-signature".

The OIDs that should be looked for are "pkcs7-signedData", 
"pkcs7-envelopedData" and "id-smime-ct-compressedData".

There are three media types with "pkcs7-signedData" at the beginning, namely:
 * "application/pkcs7-signature", extention ".p7s",  when the signed content is 
not present (detached signature)
 * "application/pkcs7-mime; smime-type=signed-data", extension ".p7m", when the 
signed content is present
 * "application/pkcs7-mime; smime-type=certs-only", extension ".p7c", when 
there are only certificates and (optionally) CRLs

When the OID is "pkcs7-envelopedData" the media type is 
"application/pkcs7-mime; smime-type=enveloped-data" and the extension is ".p7m".

When the OID is "id-smime-ct-compressedData" the media type is 
"application/pkcs7-mime; smime-type=compressed-data" and the extension is 
".p7z".

Extension ".p7b" is registered in Tika with media type 
"application/x-pkcs7-certificates" but I think the content of such files is the 
same as ".p7c" ones.

Furthermore the label in the textual encoding is always PKCS7 (i.e. the file 
begins with "-BEGIN PKCS7").

I can provide examples, built using openssl, but to support those media types 
Tika shall:
 * return parameters in media type when detecting streams
 * return different extensions based on media type parameters
 * further inspect streams when "-BEGIN PKCS7" or "pkcs7-signedData" are 
found (like it does for XML streams)
 * register "application/pkcs7-signature" as sub-class of 
"application/pkcs7-mime" (it is referred to as "degenerated case")

 

> Problem in Tika().detect for xml file signed in CADES
> -
>
> Key: TIKA-1997
> URL: https://issues.apache.org/jira/browse/TIKA-1997
> Project: Tika
>  Issue Type: Sub-task
>  Components: detector
>Affects Versions: 1.13
> Environment: JDK 1.7
>Reporter: Michele Andreano
>Priority: Blocker
> Attachments: test.xml.p7m
>
>
> When I submit a tika a xml file signed in P7M format, I expect tika return as 
> mimetype application / pkcs7-mime instead gives me application / 
> pkcs7-s

[jira] [Comment Edited] (TIKA-1997) Problem in Tika().detect for xml file signed in CADES

2019-01-18 Thread Roberto Benedetti (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16746737#comment-16746737
 ] 

Roberto Benedetti edited comment on TIKA-1997 at 1/18/19 11:14 PM:
---

Updated references are:
 * [RFC-5652, Cryptographic Message Syntax 
(CMS)|https://tools.ietf.org/html/rfc5652]
 * [RFC-5751, Secure/Multipurpose Internet Mail Extensions (S/MIME) Version 3.2 
Message Specification|https://tools.ietf.org/html/rfc5751]
 * [RFC-7468, Textual Encodings of PKIX, PKCS, and CMS 
Structures|https://tools.ietf.org/html/rfc7468]

Tika looks for any "pkcs7" OID at the beginning of the file and, if found, 
returns "application/pkcs7-signature".

The OIDs that should be looked for are "pkcs7-signedData", 
"pkcs7-envelopedData" and "id-smime-ct-compressedData".

There are three media types with "pkcs7-signedData" at the beginning, namely:
 * "application/pkcs7-signature", extention ".p7s",  when the signed content is 
not present (detached signature)
 * "application/pkcs7-mime; smime-type=signed-data", extension ".p7m", when the 
signed content is present
 * "application/pkcs7-mime; smime-type=certs-only", extension ".p7c", when 
there are only certificates and (optionally) CRLs

When the OID is "pkcs7-envelopedData" the media type is 
"application/pkcs7-mime; smime-type=enveloped-data" and the extension is ".p7m".

When the OID is "id-smime-ct-compressedData" the media type is 
"application/pkcs7-mime; smime-type=compressed-data" and the extension is 
".p7z".

Extension ".p7b" is registered in Tika with media type 
"application/x-pkcs7-certificates" but I think the content of such files is the 
same as ".p7c" ones.

Furthermore the label in the textual encoding is always PKCS7 (i.e. the file 
begins with "-BEGIN PKCS7").

I can provide examples, built using openssl, but to support those media types 
Tika shall:
 * return parameters in media type when detecting streams
 * return different extensions based on media type parameters
 * further inspect streams when "-BEGIN PKCS7" or "pkcs7-signedData" are 
found (like it does for XML streams)
 * register "application/pkcs7-signature" as sub-class of 
"application/pkcs7-mime"
 * remove "application/x-pkcs7-certificates"

 


was (Author: roberto.benedetti):
Updated references are:
 * [RFC-5652, Cryptographic Message Syntax 
(CMS)|https://tools.ietf.org/html/rfc5652]
 * [RFC-5751, Secure/Multipurpose Internet Mail Extensions (S/MIME) Version 3.2 
Message Specification|https://tools.ietf.org/html/rfc5751]
 * [RFC-7468, Textual Encodings of PKIX, PKCS, and CMS 
Structures|https://tools.ietf.org/html/rfc7468]

Tika looks for any "pkcs7" OID at the beginning of the file and, if found, 
returns "application/pkcs7-signature".

The OIDs that should be looked for are "pkcs7-signedData", 
"pkcs7-envelopedData" and "id-smime-ct-compressedData".

There are three media types with "pkcs7-signedData" at the beginning, namely:
 * "application/pkcs7-signature", extention ".p7s",  when the signed content is 
not present (detached signature)
 * "application/pkcs7-mime; smime-type=signed-data", extension ".p7m", when the 
signed content is present
 * "application/pkcs7-mime; smime-type=certs-only", extension ".p7c", when 
there are only certificates and (optionally) CRLs

When the OID is "pkcs7-envelopedData" the media type is 
"application/pkcs7-mime; smime-type=enveloped-data" and the extension is ".p7m".

When the OID is "id-smime-ct-compressedData" the media type is 
"application/pkcs7-mime; smime-type=compressed-data" and the extension is 
".p7z".

Extension ".p7b" is registered in Tika with media type 
"application/x-pkcs7-certificates" but I think the content of such files is the 
same as ".p7c" ones.

Furthermore the label in the textual encoding is always PKCS7 (i.e. the file 
begins with "-BEGIN PKCS7").

I can provide examples, built using openssl, but to support those media types 
Tika shall:
 * return parameters in media type when detecting streams
 * return different extensions based on media type parameters
 * further inspect streams when "-BEGIN PKCS7" or "pkcs7-signedData" are 
found (like it does for XML streams)
 * register "application/pkcs7-signature" as sub-class of 
"application/pkcs7-mime"

 

> Problem in Tika().detect for xml file signed in CADES
> -
>
> Key: TIKA-1997
> URL: https://issues.apache.org/jira/browse/TIKA-1997
> Project: Tika
>  Issue Type: Sub-task
>  Components: detector
>Affects Versions: 1.13
> Environment: JDK 1.7
>Reporter: Michele Andreano
>Priority: Blocker
> Attachments: test.xml.p7m
>
>
> When I submit a tika a xml file signed in P7M format, I expect tika return as 
> mimetype application / pkcs7-mime instead gives me application / 
> pkcs

[jira] [Comment Edited] (TIKA-1997) Problem in Tika().detect for xml file signed in CADES

2019-01-18 Thread Roberto Benedetti (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16746737#comment-16746737
 ] 

Roberto Benedetti edited comment on TIKA-1997 at 1/18/19 11:06 PM:
---

Updated references are:
 * [RFC-5652, Cryptographic Message Syntax 
(CMS)|https://tools.ietf.org/html/rfc5652]
 * [RFC-5751, Secure/Multipurpose Internet Mail Extensions (S/MIME) Version 3.2 
Message Specification|https://tools.ietf.org/html/rfc5751]
 * [RFC-7468, Textual Encodings of PKIX, PKCS, and CMS 
Structures|https://tools.ietf.org/html/rfc7468]

Tika looks for any "pkcs7" OID at the beginning of the file and, if found, 
returns "application/pkcs7-signature".

The OIDs that should be looked for are "pkcs7-signedData", 
"pkcs7-envelopedData" and "id-smime-ct-compressedData".

There are three media types with "pkcs7-signedData" at the beginning, namely:
 * "application/pkcs7-signature", extention ".p7s",  when the signed content is 
not present (detached signature)
 * "application/pkcs7-mime; smime-type=signed-data", extension ".p7m", when the 
signed content is present
 * "application/pkcs7-mime; smime-type=certs-only", extension ".p7c", when 
there are only certificates and (optionally) CRLs

When the OID is "pkcs7-envelopedData" the media type is 
"application/pkcs7-mime; smime-type=enveloped-data" and the extension is ".p7m".

When the OID is "id-smime-ct-compressedData" the media type is 
"application/pkcs7-mime; smime-type=compressed-data" and the extension is 
".p7z".

Extension ".p7b" is registered in Tika with media type 
"application/x-pkcs7-certificates" but I think the content of such files is the 
same as ".p7c" ones.

Furthermore the label in the textual encoding is always PKCS7 (i.e. the file 
begins with "-BEGIN PKCS7").

I can provide examples, built using openssl, but to support those media types 
Tika shall:
 * return parameters in media type when detecting streams
 * return different extensions based on media type parameters
 * further inspect streams when "-BEGIN PKCS7" or "pkcs7-signedData" are 
found (like it does for XML streams)
 * register "application/pkcs7-signature" as sub-class of 
"application/pkcs7-mime" (it is referred to as "degenerated case")

 


was (Author: roberto.benedetti):
Updated references are:
 * [RFC-5652, Cryptographic Message Syntax 
(CMS)|https://tools.ietf.org/html/rfc5652]
 * [RFC-5751, Secure/Multipurpose Internet Mail Extensions (S/MIME) Version 3.2 
Message Specification|https://tools.ietf.org/html/rfc5751]
 * [RFC-7468, Textual Encodings of PKIX, PKCS, and CMS 
Structures|https://tools.ietf.org/html/rfc7468]

Tika looks for any "pkcs7" OID at the beginning of the file and, if found, 
returns "application/pkcs7-signature".

The OIDs that should be looked for are "pkcs7-signedData", 
"pkcs7-envelopedData" and "id-smime-ct-compressedData".

There are three media types with "pkcs7-signedData" at the beginning, namely:
 * "application/pkcs7-signature", extention ".p7s",  when the signed content is 
not present (detached signature)
 * "application/pkcs7-mime; smime-type=signed-data", extension ".p7m", when the 
signed content is present
 * "application/pkcs7-mime; smime-type=certs-only", extension ".p7c", when 
there are only certificates and (optionally) CRLs

When the OID is "pkcs7-envelopedData" the media type is 
"application/pkcs7-mime; smime-type=enveloped-data" and the extension is ".p7m".

When the OID is "id-smime-ct-compressedData" the media type is 
"application/pkcs7-mime; smime-type=compressed-data" and the extension is 
".p7z".

Extension ".p7b" is registered in Tika with media type 
"application/x-pkcs7-certificates" but I think the content of such files is the 
same as ".p7c" ones.

Furthermore the label in the textual encoding is always PKCS7 (i.e. the file 
begins with "-BEGIN PKCS7").

I can provide examples, built using openssl, but to support those media types 
Tika shall:
 * return parameters in media type when detecting streams
 * return different extensions based on media type parameters
 * further inspect streams when "-BEGIN PKCS7" or "pkcs7-signedData" are 
found (like it does for XML streams)

 

> Problem in Tika().detect for xml file signed in CADES
> -
>
> Key: TIKA-1997
> URL: https://issues.apache.org/jira/browse/TIKA-1997
> Project: Tika
>  Issue Type: Sub-task
>  Components: detector
>Affects Versions: 1.13
> Environment: JDK 1.7
>Reporter: Michele Andreano
>Priority: Blocker
> Attachments: test.xml.p7m
>
>
> When I submit a tika a xml file signed in P7M format, I expect tika return as 
> mimetype application / pkcs7-mime instead gives me application / 
> pkcs7-signature.
> How is it possible?



--
This message was sent by Atlassian JIRA
(v7.6.

[jira] [Comment Edited] (TIKA-1997) Problem in Tika().detect for xml file signed in CADES

2019-01-18 Thread Roberto Benedetti (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16746737#comment-16746737
 ] 

Roberto Benedetti edited comment on TIKA-1997 at 1/18/19 11:00 PM:
---

Updated references are:
 * [RFC-5652, Cryptographic Message Syntax 
(CMS)|https://tools.ietf.org/html/rfc5652]
 * [RFC-5751, Secure/Multipurpose Internet Mail Extensions (S/MIME) Version 3.2 
Message Specification|https://tools.ietf.org/html/rfc5751]
 * [RFC-7468, Textual Encodings of PKIX, PKCS, and CMS 
Structures|https://tools.ietf.org/html/rfc7468]

Tika looks for any "pkcs7" OID at the beginning of the file and, if found, 
returns "application/pkcs7-signature".

The OIDs that should be looked for are "pkcs7-signedData", 
"pkcs7-envelopedData" and "id-smime-ct-compressedData".

There are three media types with "pkcs7-signedData" at the beginning, namely:
 * "application/pkcs7-signature", extention ".p7s",  when the signed content is 
not present (detached signature)
 * "application/pkcs7-mime; smime-type=signed-data", extension ".p7m", when the 
signed content is present
 * "application/pkcs7-mime; smime-type=certs-only", extension ".p7c", when 
there are only certificates and (optionally) CRLs

When the OID is "pkcs7-envelopedData" the media type is 
"application/pkcs7-mime; smime-type=enveloped-data" and the extension is ".p7m".

When the OID is "id-smime-ct-compressedData" the media type is 
"application/pkcs7-mime; smime-type=compressed-data" and the extension is 
".p7z".

Extension ".p7b" is registered in Tika with media type 
"application/x-pkcs7-certificates" but I think the content of such files is the 
same as ".p7c" ones.

Furthermore the label in the textual encoding is always PKCS7 (i.e. the file 
begins with "-BEGIN PKCS7").

I can provide examples, built using openssl, but to support those media types 
Tika shall:
 * return parameters in media type when detecting streams
 * return different extensions based on media type parameters
 * further inspect streams when "-BEGIN PKCS7" or "pkcs7-signedData" are 
found (like it does for XML streams)

 


was (Author: roberto.benedetti):
Updated references are:
 * [RFC-5652, Cryptographic Message Syntax 
(CMS)|https://tools.ietf.org/html/rfc5652]
 * [RFC-5751, Secure/Multipurpose Internet Mail Extensions (S/MIME) Version 3.2 
Message Specification|https://tools.ietf.org/html/rfc5751]
 * [RFC-7468, Textual Encodings of PKIX, PKCS, and CMS 
Structures|https://tools.ietf.org/html/rfc7468]

Tika looks for "pkcs7-signedData" OID at the beginning of the file and, if 
found, returns "application/pkcs7-signature".

There are, however, three media types with that OID at the beginning, namely:
 * "application/pkcs7-signature", extention ".p7s",  when the signed content is 
not present (detached signature)
 * "application/pkcs7-mime; smime-type=signed-data", extension ".p7m", when the 
signed content is present
 * "application/pkcs7-mime; smime-type=certs-only", extension ".p7c", when 
there are only certificates and (optionally) CRLs

Extension ".p7b" is registered in Tika with media type 
"application/x-pkcs7-certificates" but I think the content of such files is the 
same as ".p7c" ones.

Extension ".p7m" is also used when the OID at the beginning is 
"pkcs7-envelopedData" and the media type is "application/pkcs7-mime; 
smime-type=enveloped-data".

Extension ".p7z" is used when the OID at the beginning is 
"id-smime-ct-compressedData" and the media type is "application/pkcs7-mime; 
smime-type=compressed-data".

Furthermore the label in the textual encoding is always PKCS7 (i.e. the file 
begins with "-BEGIN PKCS7").

I can provide examples, built using openssl, but to support those media types 
Tika shall:
 * return parameters in media type when detecting streams
 * return different extensions based on media type parameters
 * further inspect streams when "-BEGIN PKCS7" or "pkcs7-signedData" are 
found (like it does for XML streams)

 

> Problem in Tika().detect for xml file signed in CADES
> -
>
> Key: TIKA-1997
> URL: https://issues.apache.org/jira/browse/TIKA-1997
> Project: Tika
>  Issue Type: Sub-task
>  Components: detector
>Affects Versions: 1.13
> Environment: JDK 1.7
>Reporter: Michele Andreano
>Priority: Blocker
> Attachments: test.xml.p7m
>
>
> When I submit a tika a xml file signed in P7M format, I expect tika return as 
> mimetype application / pkcs7-mime instead gives me application / 
> pkcs7-signature.
> How is it possible?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-1997) Problem in Tika().detect for xml file signed in CADES

2019-01-18 Thread Roberto Benedetti (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16746737#comment-16746737
 ] 

Roberto Benedetti edited comment on TIKA-1997 at 1/18/19 10:47 PM:
---

Updated references are:
 * [RFC-5652, Cryptographic Message Syntax 
(CMS)|https://tools.ietf.org/html/rfc5652]
 * [RFC-5751, Secure/Multipurpose Internet Mail Extensions (S/MIME) Version 3.2 
Message Specification|https://tools.ietf.org/html/rfc5751]
 * [RFC-7468, Textual Encodings of PKIX, PKCS, and CMS 
Structures|https://tools.ietf.org/html/rfc7468]

Tika looks for "pkcs7-signedData" OID at the beginning of the file and, if 
found, returns "application/pkcs7-signature".

There are, however, three media types with that OID at the beginning, namely:
 * "application/pkcs7-signature", extention ".p7s",  when the signed content is 
not present (detached signature)
 * "application/pkcs7-mime; smime-type=signed-data", extension ".p7m", when the 
signed content is present
 * "application/pkcs7-mime; smime-type=certs-only", extension ".p7c", when 
there are only certificates and (optionally) CRLs

Extension ".p7b" is registered in Tika with media type 
"application/x-pkcs7-certificates" but I think the content of such files is the 
same as ".p7c" ones.

Extension ".p7m" is also used when the OID at the beginning is 
"pkcs7-envelopedData" and the media type is "application/pkcs7-mime; 
smime-type=enveloped-data".

Extension ".p7z" is used when the OID at the beginning is 
"id-smime-ct-compressedData" and the media type is "application/pkcs7-mime; 
smime-type=compressed-data".

Furthermore the label in the textual encoding is always PKCS7 (i.e. the file 
begins with "-BEGIN PKCS7").

I can provide examples, built using openssl, but to support those media types 
Tika shall:
 * return parameters in media type when detecting streams
 * return different extensions based on media type parameters
 * further inspect streams when "-BEGIN PKCS7" or "pkcs7-signedData" are 
found (like it does for XML streams)

 


was (Author: roberto.benedetti):
Updated references are:
 * [RFC-5652, Cryptographic Message Syntax 
(CMS)|https://tools.ietf.org/html/rfc5652]
 * [RFC-5751, Secure/Multipurpose Internet Mail Extensions (S/MIME) Version 3.2 
Message Specification|https://tools.ietf.org/html/rfc5751]
 * [RFC-7468, Textual Encodings of PKIX, PKCS, and CMS 
Structures|https://tools.ietf.org/html/rfc7468]

Tika looks for "pkcs7-signedData" OID at the beginning of the file and, if 
found, returns "application/pkcs7-signature".

There are, however, three media types with that OID at the beginning, namely:
 * "application/pkcs7-signature", extention ".p7s",  when the signed content is 
not present (detached signature)
 * "application/pkcs7-mime; smime-type=signed-data", extension ".p7m", when the 
signed content is present
 * "application/pkcs7-mime; smime-type=certs-only", extension ".p7c", when 
there are only certificates and (optionally) CRLs

Extension ".p7b" is registered in Tika with media type 
"application/x-pkcs7-certificates" but I think the content of such files is the 
same as ".p7c" ones.

Extension ".p7m" is also used when the OID at the beginning is 
"pkcs7-envelopedData" and the media type is "application/pkcs7-mime; 
smime-type=enveloped-data".

Extension ".p7z" is used when the OID at the beginning is 
"id-smime-ct-compressedData" and the media type is "application/pkcs7-mime; 
smime-type=compressed-data".

Furthermore the label in the textual encoding is always PKCS7 (i.e. the file 
begins with "-BEGIN PKCS7").

I can provide examples, built using openssl, but to support those media types 
Tika shall:
 * return parameters in media type when detecting streams
 * return different extensions based on media type parameters
 * further inspect streams when "-BEGIN PKCS7" or pkcs7-signedData are 
found (like it does for XML streams)

 

> Problem in Tika().detect for xml file signed in CADES
> -
>
> Key: TIKA-1997
> URL: https://issues.apache.org/jira/browse/TIKA-1997
> Project: Tika
>  Issue Type: Sub-task
>  Components: detector
>Affects Versions: 1.13
> Environment: JDK 1.7
>Reporter: Michele Andreano
>Priority: Blocker
> Attachments: test.xml.p7m
>
>
> When I submit a tika a xml file signed in P7M format, I expect tika return as 
> mimetype application / pkcs7-mime instead gives me application / 
> pkcs7-signature.
> How is it possible?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-1997) Problem in Tika().detect for xml file signed in CADES

2019-01-18 Thread Roberto Benedetti (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16746737#comment-16746737
 ] 

Roberto Benedetti edited comment on TIKA-1997 at 1/18/19 10:46 PM:
---

Updated references are:
 * [RFC-5652, Cryptographic Message Syntax 
(CMS)|https://tools.ietf.org/html/rfc5652]
 * [RFC-5751, Secure/Multipurpose Internet Mail Extensions (S/MIME) Version 3.2 
Message Specification|https://tools.ietf.org/html/rfc5751]
 * [RFC-7468, Textual Encodings of PKIX, PKCS, and CMS 
Structures|https://tools.ietf.org/html/rfc7468]

Tika looks for "pkcs7-signedData" OID at the beginning of the file and, if 
found, returns "application/pkcs7-signature".

There are, however, three media types with that OID at the beginning, namely:
 * "application/pkcs7-signature", extention ".p7s",  when the signed content is 
not present (detached signature)
 * "application/pkcs7-mime; smime-type=signed-data", extension ".p7m", when the 
signed content is present
 * "application/pkcs7-mime; smime-type=certs-only", extension ".p7c", when 
there are only certificates and (optionally) CRLs

Extension ".p7b" is registered in Tika with media type 
"application/x-pkcs7-certificates" but I think the content of such files is the 
same as ".p7c" ones.

Extension ".p7m" is also used when the OID at the beginning is 
"pkcs7-envelopedData" and the media type is "application/pkcs7-mime; 
smime-type=enveloped-data".

Extension ".p7z" is used when the OID at the beginning is 
"id-smime-ct-compressedData" and the media type is "application/pkcs7-mime; 
smime-type=compressed-data".

Furthermore the label in the textual encoding is always PKCS7 (i.e. the file 
begins with "-BEGIN PKCS7").

I can provide examples, built using openssl, but to support those media types 
Tika shall:
 * return parameters in media type when detecting streams
 * return different extensions based on media type parameters
 * further inspect streams when "-BEGIN PKCS7" or pkcs7-signedData are 
found (like it does for XML streams)

 


was (Author: roberto.benedetti):
Updated references are:
 * [RFC-5652, Cryptographic Message Syntax 
(CMS)|https://tools.ietf.org/html/rfc5652]
 * [RFC-5751, Secure/Multipurpose Internet Mail Extensions (S/MIME) Version 3.2 
Message Specification|https://tools.ietf.org/html/rfc5751]
 * [RFC-7468, Textual Encodings of PKIX, PKCS, and CMS 
Structures|https://tools.ietf.org/html/rfc7468]

Tika looks for "pkcs7-signedData" OID at the beginning of the file and, if 
found, returns "application/pkcs7-signature".

There are, however, three media types with that OID at the beginning, namely:
 * "application/pkcs7-signature", extention ".p7s",  when the signed content is 
not present (detached signature)
 * "application/pkcs7-mime; smime-type=signed-data", extension ".p7m", when the 
signed content is present
 * "application/pkcs7-mime; smime-type=certs-only", extension ".p7c" (".p7b" 
not mentioned but can be found too), when there are only certificates and 
(optionally) CRLs

Extension ".p7b" is registered in Tika with media type 
"application/x-pkcs7-certificates" but I think the content of such files is the 
same as ".p7c" ones.

Extension ".p7m" is also used when the OID at the beginning is 
"pkcs7-envelopedData" and the media type is "application/pkcs7-mime; 
smime-type=enveloped-data".

Extension ".p7z" is used when the OID at the beginning is 
"id-smime-ct-compressedData" and the media type is "application/pkcs7-mime; 
smime-type=compressed-data".

Furthermore the label in the textual encoding is always PKCS7 (i.e. the file 
begins with "-BEGIN PKCS7").

I can provide examples, built using openssl, but to support those media types 
Tika shall:
 * return parameters in media type when detecting streams
 * return different extensions based on media type parameters
 * further inspect streams when "-BEGIN PKCS7" or pkcs7-signedData are 
found (like it does for XML streams)

 

> Problem in Tika().detect for xml file signed in CADES
> -
>
> Key: TIKA-1997
> URL: https://issues.apache.org/jira/browse/TIKA-1997
> Project: Tika
>  Issue Type: Sub-task
>  Components: detector
>Affects Versions: 1.13
> Environment: JDK 1.7
>Reporter: Michele Andreano
>Priority: Blocker
> Attachments: test.xml.p7m
>
>
> When I submit a tika a xml file signed in P7M format, I expect tika return as 
> mimetype application / pkcs7-mime instead gives me application / 
> pkcs7-signature.
> How is it possible?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-1997) Problem in Tika().detect for xml file signed in CADES

2019-01-18 Thread Roberto Benedetti (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16746737#comment-16746737
 ] 

Roberto Benedetti edited comment on TIKA-1997 at 1/18/19 10:45 PM:
---

Updated references are:
 * [RFC-5652, Cryptographic Message Syntax 
(CMS)|https://tools.ietf.org/html/rfc5652]
 * [RFC-5751, Secure/Multipurpose Internet Mail Extensions (S/MIME) Version 3.2 
Message Specification|https://tools.ietf.org/html/rfc5751]
 * [RFC-7468, Textual Encodings of PKIX, PKCS, and CMS 
Structures|https://tools.ietf.org/html/rfc7468]

Tika looks for "pkcs7-signedData" OID at the beginning of the file and, if 
found, returns "application/pkcs7-signature".

There are, however, three media types with that OID at the beginning, namely:
 * "application/pkcs7-signature", extention ".p7s",  when the signed content is 
not present (detached signature)
 * "application/pkcs7-mime; smime-type=signed-data", extension ".p7m", when the 
signed content is present
 * "application/pkcs7-mime; smime-type=certs-only", extension ".p7c" (".p7b" 
not mentioned but can be found too), when there are only certificates and 
(optionally) CRLs

Extension ".p7b" is registered in Tika with media type 
"application/x-pkcs7-certificates" but I think the content of such files is the 
same as ".p7c" ones.

Extension ".p7m" is also used when the OID at the beginning is 
"pkcs7-envelopedData" and the media type is "application/pkcs7-mime; 
smime-type=enveloped-data".

Extension ".p7z" is used when the OID at the beginning is 
"id-smime-ct-compressedData" and the media type is "application/pkcs7-mime; 
smime-type=compressed-data".

Furthermore the label in the textual encoding is always PKCS7 (i.e. the file 
begins with "-BEGIN PKCS7").

I can provide examples, built using openssl, but to support those media types 
Tika shall:
 * return parameters in media type when detecting streams
 * return different extensions based on media type parameters
 * further inspect streams when "-BEGIN PKCS7" or pkcs7-signedData are 
found (like it does for XML streams)

 


was (Author: roberto.benedetti):
Updated references are:
 * [RFC-5652, Cryptographic Message Syntax 
(CMS)|https://tools.ietf.org/html/rfc5652]
 * [RFC-5751, Secure/Multipurpose Internet Mail Extensions (S/MIME) Version 3.2 
Message Specification|https://tools.ietf.org/html/rfc5751]
 * [RFC-7468, Textual Encodings of PKIX, PKCS, and CMS 
Structures|https://tools.ietf.org/html/rfc7468]

Tika looks for "pkcs7-signedData" OID at the beginning of the file and, if 
found, returns "application/pkcs7-signature".

There are, however, three media types with that OID at the beginning, namely:
 * "application/pkcs7-signature", extention ".p7s",  when the signed content is 
not present (detached signature)
 * "application/pkcs7-mime; smime-type=signed-data", extension ".p7m", when the 
signed content is present
 * "application/pkcs7-mime; smime-type=certs-only", extension ".p7c" (".p7b" 
not mentioned but can be found too), when there are only certificates and 
(optionally) CRLs

Extension ".p7m" is also used when the OID at the beginning is 
"pkcs7-envelopedData" and the media type is "application/pkcs7-mime; 
smime-type=enveloped-data".

Extension ".p7z" is used when the OID at the beginning is 
"id-smime-ct-compressedData" and the media type is "application/pkcs7-mime; 
smime-type=compressed-data".

Furthermore the label in the textual encoding is always PKCS7 (i.e. the file 
begins with "-BEGIN PKCS7").

I can provide examples, built using openssl, but to support those media types 
Tika shall:
 * return parameters in media type when detecting streams
 * return different extensions based on media type parameters
 * further inspect streams when "-BEGIN PKCS7" or pkcs7-signedData are 
found (like it does for XML streams)

 

> Problem in Tika().detect for xml file signed in CADES
> -
>
> Key: TIKA-1997
> URL: https://issues.apache.org/jira/browse/TIKA-1997
> Project: Tika
>  Issue Type: Sub-task
>  Components: detector
>Affects Versions: 1.13
> Environment: JDK 1.7
>Reporter: Michele Andreano
>Priority: Blocker
> Attachments: test.xml.p7m
>
>
> When I submit a tika a xml file signed in P7M format, I expect tika return as 
> mimetype application / pkcs7-mime instead gives me application / 
> pkcs7-signature.
> How is it possible?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-1997) Problem in Tika().detect for xml file signed in CADES

2019-01-18 Thread Roberto Benedetti (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16746737#comment-16746737
 ] 

Roberto Benedetti commented on TIKA-1997:
-

Updated references are:
 * [RFC-5652, Cryptographic Message Syntax 
(CMS)|https://tools.ietf.org/html/rfc5652]
 * [RFC-5751, Secure/Multipurpose Internet Mail Extensions (S/MIME) Version 3.2 
Message Specification|https://tools.ietf.org/html/rfc5751]
 * [RFC-7468, Textual Encodings of PKIX, PKCS, and CMS 
Structures|https://tools.ietf.org/html/rfc7468]

Tika looks for "pkcs7-signedData" OID at the beginning of the file and, if 
found, returns "application/pkcs7-signature".

There are, however, three media types with that OID at the beginning, namely:
 * "application/pkcs7-signature", extention ".p7s",  when the signed content is 
not present (detached signature)
 * "application/pkcs7-mime; smime-type=signed-data", extension ".p7m", when the 
signed content is present
 * "application/pkcs7-mime; smime-type=certs-only", extension ".p7c" (".p7b" 
not mentioned but can be found too), when there are only certificates and 
(optionally) CRLs

Extension ".p7m" is also used when the OID at the beginning is 
"pkcs7-envelopedData" and the media type is "application/pkcs7-mime; 
smime-type=enveloped-data".

Extension ".p7z" is used when the OID at the beginning is 
"id-smime-ct-compressedData" and the media type is "application/pkcs7-mime; 
smime-type=compressed-data".

Furthermore the label in the textual encoding is always PKCS7 (i.e. the file 
begins with "-BEGIN PKCS7").

I can provide examples, built using openssl, but to support those media types 
Tika shall:
 * return parameters in media type when detecting streams
 * return different extensions based on media type parameters
 * further inspect streams when "-BEGIN PKCS7" or pkcs7-signedData are 
found (like it does for XML streams)

 

> Problem in Tika().detect for xml file signed in CADES
> -
>
> Key: TIKA-1997
> URL: https://issues.apache.org/jira/browse/TIKA-1997
> Project: Tika
>  Issue Type: Sub-task
>  Components: detector
>Affects Versions: 1.13
> Environment: JDK 1.7
>Reporter: Michele Andreano
>Priority: Blocker
> Attachments: test.xml.p7m
>
>
> When I submit a tika a xml file signed in P7M format, I expect tika return as 
> mimetype application / pkcs7-mime instead gives me application / 
> pkcs7-signature.
> How is it possible?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2818) RarParser throws EncryptedDocumentException only when whole archive is encrypted

2019-01-18 Thread JIRA


[ 
https://issues.apache.org/jira/browse/TIKA-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16746515#comment-16746515
 ] 

Pavel Arnošt commented on TIKA-2818:


Hi,

thanks for response. I attached "encrypted.rar" sample where "encrypted.txt" is 
encrypted and "plain.txt" is not.

> RarParser throws EncryptedDocumentException only when whole archive is 
> encrypted
> 
>
> Key: TIKA-2818
> URL: https://issues.apache.org/jira/browse/TIKA-2818
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.20
>Reporter: Pavel Arnošt
>Priority: Minor
> Attachments: encrypted.rar, rar4_encrypted_content_only.rar
>
>
> RarParser throws EncryptedDocumentException only if whole archive is 
> encrypted. If encryption is on individial files, parser ends with 
> org.apache.tika.exception.TikaException: RarParser Exception:
> Caused by: org.apache.tika.exception.TikaException: RarParser Exception
>  at org.apache.tika.parser.pkg.RarParser.parse(RarParser.java:99)
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>  at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>  at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:159)
>  at ... 43 more
> Caused by: com.github.junrar.exception.RarException: ioError
>  at com.github.junrar.Archive.getInputStream(Archive.java:525)
>  at org.apache.tika.parser.pkg.RarParser.parse(RarParser.java:81)
>  ... 48 more
> Caused by: com.github.junrar.exception.RarException: crcError
>  at com.github.junrar.Archive.doExtractFile(Archive.java:557)
>  at com.github.junrar.Archive.extractFile(Archive.java:498)
>  at com.github.junrar.Archive.getInputStream(Archive.java:523)
>  ... 49 more
> File encryption should be checked before trying to extract content on line 79 
> like this:
> FileHeader header = rar.nextFileHeader();
> if (header.isEncrypted()) {
>     throw new EncryptedDocumentException();
> }
> while (header != null && !Thread.currentThread().isInterrupted()) {
> Or maybe insert it into metadata with 
> TikaCoreProperties.TIKA_META_EXCEPTION_EMBEDDED_STREAM key? I don't know, but 
> current behaviour is not correct (parsing fails).
> Sample document is attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2818) RarParser throws EncryptedDocumentException only when whole archive is encrypted

2019-01-18 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/TIKA-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Arnošt updated TIKA-2818:
---
Attachment: encrypted.rar

> RarParser throws EncryptedDocumentException only when whole archive is 
> encrypted
> 
>
> Key: TIKA-2818
> URL: https://issues.apache.org/jira/browse/TIKA-2818
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.20
>Reporter: Pavel Arnošt
>Priority: Minor
> Attachments: encrypted.rar, rar4_encrypted_content_only.rar
>
>
> RarParser throws EncryptedDocumentException only if whole archive is 
> encrypted. If encryption is on individial files, parser ends with 
> org.apache.tika.exception.TikaException: RarParser Exception:
> Caused by: org.apache.tika.exception.TikaException: RarParser Exception
>  at org.apache.tika.parser.pkg.RarParser.parse(RarParser.java:99)
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>  at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>  at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:159)
>  at ... 43 more
> Caused by: com.github.junrar.exception.RarException: ioError
>  at com.github.junrar.Archive.getInputStream(Archive.java:525)
>  at org.apache.tika.parser.pkg.RarParser.parse(RarParser.java:81)
>  ... 48 more
> Caused by: com.github.junrar.exception.RarException: crcError
>  at com.github.junrar.Archive.doExtractFile(Archive.java:557)
>  at com.github.junrar.Archive.extractFile(Archive.java:498)
>  at com.github.junrar.Archive.getInputStream(Archive.java:523)
>  ... 49 more
> File encryption should be checked before trying to extract content on line 79 
> like this:
> FileHeader header = rar.nextFileHeader();
> if (header.isEncrypted()) {
>     throw new EncryptedDocumentException();
> }
> while (header != null && !Thread.currentThread().isInterrupted()) {
> Or maybe insert it into metadata with 
> TikaCoreProperties.TIKA_META_EXCEPTION_EMBEDDED_STREAM key? I don't know, but 
> current behaviour is not correct (parsing fails).
> Sample document is attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2818) RarParser throws EncryptedDocumentException only when whole archive is encrypted

2019-01-18 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16746490#comment-16746490
 ] 

Tim Allison commented on TIKA-2818:
---

Sorry, I should have started with: Thank you for raising this issue and sharing 
an example file.

Is there any chance you could share a sample file where the first file is 
encrypted but the second file is not?

Thank you, again!

> RarParser throws EncryptedDocumentException only when whole archive is 
> encrypted
> 
>
> Key: TIKA-2818
> URL: https://issues.apache.org/jira/browse/TIKA-2818
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.20
>Reporter: Pavel Arnošt
>Priority: Minor
> Attachments: rar4_encrypted_content_only.rar
>
>
> RarParser throws EncryptedDocumentException only if whole archive is 
> encrypted. If encryption is on individial files, parser ends with 
> org.apache.tika.exception.TikaException: RarParser Exception:
> Caused by: org.apache.tika.exception.TikaException: RarParser Exception
>  at org.apache.tika.parser.pkg.RarParser.parse(RarParser.java:99)
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>  at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>  at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:159)
>  at ... 43 more
> Caused by: com.github.junrar.exception.RarException: ioError
>  at com.github.junrar.Archive.getInputStream(Archive.java:525)
>  at org.apache.tika.parser.pkg.RarParser.parse(RarParser.java:81)
>  ... 48 more
> Caused by: com.github.junrar.exception.RarException: crcError
>  at com.github.junrar.Archive.doExtractFile(Archive.java:557)
>  at com.github.junrar.Archive.extractFile(Archive.java:498)
>  at com.github.junrar.Archive.getInputStream(Archive.java:523)
>  ... 49 more
> File encryption should be checked before trying to extract content on line 79 
> like this:
> FileHeader header = rar.nextFileHeader();
> if (header.isEncrypted()) {
>     throw new EncryptedDocumentException();
> }
> while (header != null && !Thread.currentThread().isInterrupted()) {
> Or maybe insert it into metadata with 
> TikaCoreProperties.TIKA_META_EXCEPTION_EMBEDDED_STREAM key? I don't know, but 
> current behaviour is not correct (parsing fails).
> Sample document is attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2818) RarParser throws EncryptedDocumentException only when whole archive is encrypted

2019-01-18 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16746482#comment-16746482
 ] 

Tim Allison commented on TIKA-2818:
---

Something like this from the RecursiveParserWrapper?

{noformat}
0: X-Parsed-By : org.apache.tika.parser.DefaultParser
0: X-Parsed-By : org.apache.tika.parser.pkg.RarParser
0: X-TIKA:content_handler : ToXMLContentHandler
0: X-TIKA:parse_time_millis : 195
0: X-TIKA:content : http://www.w3.org/1999/xhtml";>






 

encrypted.txt
0: Content-Type : application/x-rar-compressed
1: embeddedRelationshipId : encrypted.txt
1: X-TIKA:EXCEPTION:embedded_exception : 
org.apache.tika.exception.EncryptedDocumentException: Unable to process: 
document is encrypted
at 
org.apache.tika.parser.pkg.RarParser$EncryptedDocumentExceptionInputStream.read(RarParser.java:119)
at java.io.InputStream.read(InputStream.java:170)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:78)
at org.apache.tika.io.TikaInputStream.peek(TikaInputStream.java:572)
at 
org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:149)
at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:116)
at 
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:147)
at 
org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:370)
at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:105)
at org.apache.tika.parser.pkg.RarParser.parse(RarParser.java:90)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:277)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:277)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at 
org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:224)
at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:263)
at org.apache.tika.TikaTest.getRecursiveMetadata(TikaTest.java:219)
at 
org.apache.tika.parser.pkg.RarParserTest.testSingleEncryptedRar(RarParserTest.java:163)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
at 
com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
at 
com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
at 
com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
at 
com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)

1: meta:save-date : 2019-01-18T10:17:30Z
1: X-TIKA:EXCEPTION:embedded_parser : org.apache.tika.parser.AutoDetectParser
1: X-TIKA:parse_time_millis : 5
1: resourceName : encrypted.

[jira] [Commented] (TIKA-2817) Tika doesn't respect gzip filename

2019-01-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16746479#comment-16746479
 ] 

ASF GitHub Bot commented on TIKA-2817:
--

tombrisland commented on pull request #265: fix for TIKA-2817 contributed by 
tombrisland
URL: https://github.com/apache/tika/pull/265
 
 
   If the resource name isn't populated for a gzip, then CompressorParser will 
now extract the decompressed filename from gzip headers and store it in entry 
metadata.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tika doesn't respect gzip filename
> --
>
> Key: TIKA-2817
> URL: https://issues.apache.org/jira/browse/TIKA-2817
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Tom Brisland
>Priority: Trivial
>
> Gzip allows for the original filename to be stored in a header on compression 
> then reinstated on decompression.
> At present Tika doesn't read these headers so filename is not stored in the 
> metadata as it should be.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2818) RarParser throws EncryptedDocumentException only when whole archiveis encrypted

2019-01-18 Thread JIRA
Pavel Arnošt created TIKA-2818:
--

 Summary: RarParser throws EncryptedDocumentException only when 
whole archiveis encrypted
 Key: TIKA-2818
 URL: https://issues.apache.org/jira/browse/TIKA-2818
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.20
Reporter: Pavel Arnošt
 Attachments: rar4_encrypted_content_only.rar

RarParser throws EncryptedDocumentException only if whole archive is encrypted. 
If encryption is on individial files, parser ends with 
org.apache.tika.exception.TikaException: RarParser Exception:

Caused by: org.apache.tika.exception.TikaException: RarParser Exception
 at org.apache.tika.parser.pkg.RarParser.parse(RarParser.java:99)
 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
 at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
 at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:159)
 at ... 43 more
Caused by: com.github.junrar.exception.RarException: ioError
 at com.github.junrar.Archive.getInputStream(Archive.java:525)
 at org.apache.tika.parser.pkg.RarParser.parse(RarParser.java:81)
 ... 48 more
Caused by: com.github.junrar.exception.RarException: crcError
 at com.github.junrar.Archive.doExtractFile(Archive.java:557)
 at com.github.junrar.Archive.extractFile(Archive.java:498)
 at com.github.junrar.Archive.getInputStream(Archive.java:523)
 ... 49 more

File encryption should be checked before trying to extract content on line 79 
like this:

FileHeader header = rar.nextFileHeader();

if (header.isEncrypted()) {
    throw new EncryptedDocumentException();
}

while (header != null && !Thread.currentThread().isInterrupted()) {

Or maybe insert it into metadata with 
TikaCoreProperties.TIKA_META_EXCEPTION_EMBEDDED_STREAM key? I don't know, but 
current behaviour is not correct (parsing fails).

Sample document is attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2818) RarParser throws EncryptedDocumentException only when whole archive is encrypted

2019-01-18 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/TIKA-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Arnošt updated TIKA-2818:
---
Summary: RarParser throws EncryptedDocumentException only when whole 
archive is encrypted  (was: RarParser throws EncryptedDocumentException only 
when whole archiveis encrypted)

> RarParser throws EncryptedDocumentException only when whole archive is 
> encrypted
> 
>
> Key: TIKA-2818
> URL: https://issues.apache.org/jira/browse/TIKA-2818
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.20
>Reporter: Pavel Arnošt
>Priority: Minor
> Attachments: rar4_encrypted_content_only.rar
>
>
> RarParser throws EncryptedDocumentException only if whole archive is 
> encrypted. If encryption is on individial files, parser ends with 
> org.apache.tika.exception.TikaException: RarParser Exception:
> Caused by: org.apache.tika.exception.TikaException: RarParser Exception
>  at org.apache.tika.parser.pkg.RarParser.parse(RarParser.java:99)
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>  at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>  at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>  at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:159)
>  at ... 43 more
> Caused by: com.github.junrar.exception.RarException: ioError
>  at com.github.junrar.Archive.getInputStream(Archive.java:525)
>  at org.apache.tika.parser.pkg.RarParser.parse(RarParser.java:81)
>  ... 48 more
> Caused by: com.github.junrar.exception.RarException: crcError
>  at com.github.junrar.Archive.doExtractFile(Archive.java:557)
>  at com.github.junrar.Archive.extractFile(Archive.java:498)
>  at com.github.junrar.Archive.getInputStream(Archive.java:523)
>  ... 49 more
> File encryption should be checked before trying to extract content on line 79 
> like this:
> FileHeader header = rar.nextFileHeader();
> if (header.isEncrypted()) {
>     throw new EncryptedDocumentException();
> }
> while (header != null && !Thread.currentThread().isInterrupted()) {
> Or maybe insert it into metadata with 
> TikaCoreProperties.TIKA_META_EXCEPTION_EMBEDDED_STREAM key? I don't know, but 
> current behaviour is not correct (parsing fails).
> Sample document is attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2817) Tika doesn't respect gzip filename

2019-01-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745913#comment-16745913
 ] 

ASF GitHub Bot commented on TIKA-2817:
--

tombrisland commented on pull request #264: fix for TIKA-2817 contributed by 
tombrisland
URL: https://github.com/apache/tika/pull/264
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tika doesn't respect gzip filename
> --
>
> Key: TIKA-2817
> URL: https://issues.apache.org/jira/browse/TIKA-2817
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Tom Brisland
>Priority: Trivial
>
> Gzip allows for the original filename to be stored in a header on compression 
> then reinstated on decompression.
> At present Tika doesn't read these headers so filename is not stored in the 
> metadata as it should be.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)