Yury Kats created TIKA-2680:
-------------------------------

             Summary: Email attachments to an email are not extracted
                 Key: TIKA-2680
                 URL: https://issues.apache.org/jira/browse/TIKA-2680
             Project: Tika
          Issue Type: Bug
    Affects Versions: 1.18
            Reporter: Yury Kats
         Attachments: nested.eml

I have a number of email messages that contain other email messages as 
attachments (with multiple levels of nesting).

The email attachments are parts with "Content-Type: message/rfc822" but are not 
being recognized as such.

Attached is an example email, with the multiple levels of attachments:
 * Subject: Test email within email

 ** Subject: Email within email test
 *** Subject: Stand-up today

 

I would like to see 3 separate emails parsed out (top level, 1st level attached 
email, 2nd level attached email), but I only get 1 email and 1 unnamed text 
attachment:
{noformat}
$ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool
[
{
"Author": "Hoeven Van der, H (Hedde) <[email protected]>",
"Content-Length": "17237",
"Content-Type": "message/rfc822",
"Creation-Date": "2018-04-25T12:46:41Z",
"Message-From": "Hoeven Van der, H (Hedde) <[email protected]>",
"Message-To": [
"fm.uk.london.SAN Management Team <[email protected]>",
"Hoeven Van der, H (Hedde) <[email protected]>"
],
"Message:From-Email": "[email protected]",
"Message:From-Name": "Hoeven Van der, H (Hedde)",
"Message:Raw-Header:Auto-Submitted": "auto-generated",
"Message:Raw-Header:Content-Transfer-Encoding": "binary",
"Message:Raw-Header:Keywords": "",
"Message:Raw-Header:MIME-Version": "1.0",
"Message:Raw-Header:Message-ID": 
"<[email protected]>",
"Message:Raw-Header:Return-Path": "<>",
"Message:Raw-Header:Sender": 
"<[email protected]>",
"Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent",
"Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
"<[email protected]>",
"Message:Raw-Header:X-MS-Journal-Report": "",
"Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_",
"Multipart-Subtype": "mixed",
"X-Parsed-By": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.mail.RFC822Parser"
],
"X-TIKA:parse_time_millis": "326",
"creator": "Hoeven Van der, H (Hedde) <[email protected]>",
"dc:creator": "Hoeven Van der, H (Hedde) <[email protected]>",
"dc:title": "Test email within email",
"dcterms:created": "2018-04-25T12:46:41Z",
"meta:author": "Hoeven Van der, H (Hedde) <[email protected]>",
"meta:creation-date": "2018-04-25T12:46:41Z",
"resourceName": "nested.eml",
"subject": "Test email within email"
},
{
"Content-Encoding": "US-ASCII",
"Content-Type": "text/plain; charset=US-ASCII",
"Multipart-Boundary": 
"_004_8075737674787666767166806676697476787366657271727266777_",
"Multipart-Subtype": "mixed",
"X-Parsed-By": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.txt.TXTParser"
],
"X-TIKA:embedded_resource_path": "/embedded-1",
"X-TIKA:parse_time_millis": "5",
"embeddedResourceType": "ATTACHMENT"
}
]
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to