[
https://issues.apache.org/jira/browse/TIKA-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yury Kats updated TIKA-2685:
----------------------------
Description:
I have a number of email messages that are reports of deliverable emails that
contain the original email message as attachment.
The original emails are parts with "Content-Type: message/rfc822" but are not
being recognized as such.
Attached is an example email:
* Subject: Undeliverable: SRE Agent Out of Space Source:WindowsApp
** Subject: Subject: SRE Agent Out of Space Source:WindowsApp
I would like to see 2 separate emails parsed out (top level undeliverable
report, 1st level attached original email), but I get 1 email and 2 unnamed
text attachments:
{noformat}
$ java -jar tika-app-1.18.jar -m -J /tmp/undeliverable.eml | python -m
json.tool
[
{
"Author": "[email protected]",
"Content-Length": "17356",
"Content-Type": "message/rfc822",
"Creation-Date": "2017-11-04T16:00:11Z",
"Message-From": "[email protected]",
"Message-To": "[email protected]",
"Message:From-Email": "[email protected]",
"Message:Raw-Header:Auto-Submitted": "auto-generated",
"Message:Raw-Header:MIME-Version": "1.0",
"Message:Raw-Header:Message-ID":
"<[email protected]>",
"Message:Raw-Header:Return-Path": "<>",
"Message:Raw-Header:Sender":
"<[email protected]>",
"Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal
Agent",
"Message:Raw-Header:X-MS-Exchange-Message-Is-Ndr": "",
"Message:Raw-Header:X-MS-Exchange-Parent-Message-Id":
"\t<[email protected]>",
"Message:Raw-Header:X-MS-Journal-Report": "",
"Multipart-Boundary": "_5a8d7320-7cd6-4c1b-8e30-9616634562b2_",
"Multipart-Subtype": "mixed",
"X-Parsed-By": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.mail.RFC822Parser"
],
"X-TIKA:parse_time_millis": "326",
"creator": "[email protected]",
"dc:creator": "[email protected]",
"dc:title": "Undeliverable: SRE Agent Out of Space Source:WindowsApp",
"dcterms:created": "2017-11-04T16:00:11Z",
"meta:author": "[email protected]",
"meta:creation-date": "2017-11-04T16:00:11Z",
"resourceName": "undeliverable.eml",
"subject": "Undeliverable: SRE Agent Out of Space Source:WindowsApp"
},
{
"Content-Encoding": "windows-1252",
"Content-Type": "text/plain; charset=windows-1252",
"Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_",
"Multipart-Subtype": "report",
"X-Parsed-By": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.txt.TXTParser"
],
"X-TIKA:embedded_resource_path": "/embedded-1",
"X-TIKA:parse_time_millis": "4",
"embeddedResourceType": "ATTACHMENT"
},
{
"Content-Encoding": "US-ASCII",
"Content-Type": "text/html; charset=US-ASCII",
"Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_",
"Multipart-Subtype": "report",
"X-Parsed-By": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.html.HtmlParser"
],
"X-TIKA:embedded_resource_path": "/embedded-2",
"X-TIKA:parse_time_millis": "7",
"embeddedResourceType": "ATTACHMENT"
}
]
{noformat}
was:
I have a number of email messages that are reported of deliverable emails that
contain the opriginal email message as attachment.
The original emails are parts with "Content-Type: message/rfc822" but are not
being recognized as such.
Attached is an example email:
* Subject: Undeliverable: SRE Agent Out of Space Source:WindowsApp
** Subject: Subject: SRE Agent Out of Space Source:WindowsApp
I would like to see 2 separate emails parsed out (top level undeliverable
report, 1st level attached original email), but I get 1 email and 2 unnamed
text attachments:
{noformat}
$ java -jar tika-app-1.18.jar -m -J /tmp/undeliverable.eml | python -m
json.tool
[
{
"Author": "[email protected]",
"Content-Length": "17356",
"Content-Type": "message/rfc822",
"Creation-Date": "2017-11-04T16:00:11Z",
"Message-From": "[email protected]",
"Message-To": "[email protected]",
"Message:From-Email": "[email protected]",
"Message:Raw-Header:Auto-Submitted": "auto-generated",
"Message:Raw-Header:MIME-Version": "1.0",
"Message:Raw-Header:Message-ID":
"<[email protected]>",
"Message:Raw-Header:Return-Path": "<>",
"Message:Raw-Header:Sender":
"<[email protected]>",
"Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal
Agent",
"Message:Raw-Header:X-MS-Exchange-Message-Is-Ndr": "",
"Message:Raw-Header:X-MS-Exchange-Parent-Message-Id":
"\t<[email protected]>",
"Message:Raw-Header:X-MS-Journal-Report": "",
"Multipart-Boundary": "_5a8d7320-7cd6-4c1b-8e30-9616634562b2_",
"Multipart-Subtype": "mixed",
"X-Parsed-By": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.mail.RFC822Parser"
],
"X-TIKA:parse_time_millis": "326",
"creator": "[email protected]",
"dc:creator": "[email protected]",
"dc:title": "Undeliverable: SRE Agent Out of Space Source:WindowsApp",
"dcterms:created": "2017-11-04T16:00:11Z",
"meta:author": "[email protected]",
"meta:creation-date": "2017-11-04T16:00:11Z",
"resourceName": "undeliverable.eml",
"subject": "Undeliverable: SRE Agent Out of Space Source:WindowsApp"
},
{
"Content-Encoding": "windows-1252",
"Content-Type": "text/plain; charset=windows-1252",
"Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_",
"Multipart-Subtype": "report",
"X-Parsed-By": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.txt.TXTParser"
],
"X-TIKA:embedded_resource_path": "/embedded-1",
"X-TIKA:parse_time_millis": "4",
"embeddedResourceType": "ATTACHMENT"
},
{
"Content-Encoding": "US-ASCII",
"Content-Type": "text/html; charset=US-ASCII",
"Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_",
"Multipart-Subtype": "report",
"X-Parsed-By": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.html.HtmlParser"
],
"X-TIKA:embedded_resource_path": "/embedded-2",
"X-TIKA:parse_time_millis": "7",
"embeddedResourceType": "ATTACHMENT"
}
]
{noformat}
> Email attached to an undeliverable email report are not extracted
> -----------------------------------------------------------------
>
> Key: TIKA-2685
> URL: https://issues.apache.org/jira/browse/TIKA-2685
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.18
> Reporter: Yury Kats
> Priority: Major
> Attachments: undeliverable.eml
>
>
> I have a number of email messages that are reports of deliverable emails that
> contain the original email message as attachment.
> The original emails are parts with "Content-Type: message/rfc822" but are not
> being recognized as such.
> Attached is an example email:
> * Subject: Undeliverable: SRE Agent Out of Space Source:WindowsApp
> ** Subject: Subject: SRE Agent Out of Space Source:WindowsApp
>
> I would like to see 2 separate emails parsed out (top level undeliverable
> report, 1st level attached original email), but I get 1 email and 2 unnamed
> text attachments:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J /tmp/undeliverable.eml | python -m
> json.tool
> [
> {
> "Author": "[email protected]",
> "Content-Length": "17356",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2017-11-04T16:00:11Z",
> "Message-From": "[email protected]",
> "Message-To": "[email protected]",
> "Message:From-Email": "[email protected]",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID":
> "<[email protected]>",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender":
> "<[email protected]>",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal
> Agent",
> "Message:Raw-Header:X-MS-Exchange-Message-Is-Ndr": "",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id":
> "\t<[email protected]>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_5a8d7320-7cd6-4c1b-8e30-9616634562b2_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "326",
> "creator": "[email protected]",
> "dc:creator": "[email protected]",
> "dc:title": "Undeliverable: SRE Agent Out of Space Source:WindowsApp",
> "dcterms:created": "2017-11-04T16:00:11Z",
> "meta:author": "[email protected]",
> "meta:creation-date": "2017-11-04T16:00:11Z",
> "resourceName": "undeliverable.eml",
> "subject": "Undeliverable: SRE Agent Out of Space Source:WindowsApp"
> },
> {
> "Content-Encoding": "windows-1252",
> "Content-Type": "text/plain; charset=windows-1252",
> "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_",
> "Multipart-Subtype": "report",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "4",
> "embeddedResourceType": "ATTACHMENT"
> },
> {
> "Content-Encoding": "US-ASCII",
> "Content-Type": "text/html; charset=US-ASCII",
> "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_",
> "Multipart-Subtype": "report",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.html.HtmlParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-2",
> "X-TIKA:parse_time_millis": "7",
> "embeddedResourceType": "ATTACHMENT"
> }
> ]
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)