[ 
https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16528296#comment-16528296
 ] 

Yury Kats edited comment on TIKA-2680 at 6/29/18 9:54 PM:
----------------------------------------------------------

This appears to be due to the fact that mime4j treats an attached email as "new 
message" (correctly) and not as a "part" of the original email.

MailContentHandler#body is not being called. Instead 
MailContentHandler#startMessage is called, and thus MailContentHandler does not 
do any recursive parsing/extraction. The parts of the nested message are 
treated as parts of the original.

This is somewhat complicated by the fact that not all nested messages are 
attachments. When they are not, the current behavior of MailContentHandler is 
fine. 

I would think that a way to fix this would be for mime4j to differentiate 
message/rfc822 parts when they are attachments (have "Content-Disposition: 
attachment" header) and when not, and notify ContentHandler differently, so 
that MailContentHandler can choose an appropriate action.

 


was (Author: yurykats):
This appears to be due to the fact that mime4j treats an attached email as "new 
message" (correctly) and not as a "part" of the original email.

MailContentHandler#body is not being called. Instead 
MailContentHandler#startMessage is called, and thus MailContentHandler does not 
do any recursive parsing/extraction. The parts of the nested message are 
treated as parts of the original.

> Email attachments to an email are not extracted
> -----------------------------------------------
>
>                 Key: TIKA-2680
>                 URL: https://issues.apache.org/jira/browse/TIKA-2680
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.18
>            Reporter: Yury Kats
>            Priority: Major
>         Attachments: nested.eml
>
>
> I have a number of email messages that contain other email messages as 
> attachments (with multiple levels of nesting).
> The email attachments are parts with "Content-Type: message/rfc822" but are 
> not being recognized as such.
> Attached is an example email, with the multiple levels of attachments:
>  * Subject: Test email within email
>  ** Subject: Email within email test
>  *** Subject: Stand-up today
>  
> I would like to see 3 separate emails parsed out (top level, 1st level 
> attached email, 2nd level attached email), but I only get 1 email and 1 
> unnamed text attachment:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool
> [
> {
> "Author": "Smith Van der, H (Henry) <[email protected]>",
> "Content-Length": "16649",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2018-04-25T12:46:41Z",
> "Message-From": "Smith Van der, H (Henry) <[email protected]>",
> "Message-To": [
> "fm.SAN Management Team <[email protected]>",
> "Smith Van der, H (Henry) <[email protected]>"
> ],
> "Message:From-Email": "[email protected]",
> "Message:From-Name": "Smith Van der, H (Henry)",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:Content-Transfer-Encoding": "binary",
> "Message:Raw-Header:Keywords": "",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": 
> "<[email protected]>",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": 
> "<[email protected]>",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "<[email protected]>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "325",
> "creator": "Smith Van der, H (Henry) <[email protected]>",
> "dc:creator": "Smith Van der, H (Henry) <[email protected]>",
> "dc:title": "Test email within email",
> "dcterms:created": "2018-04-25T12:46:41Z",
> "meta:author": "Smith Van der, H (Henry) <[email protected]>",
> "meta:creation-date": "2018-04-25T12:46:41Z",
> "resourceName": "nested.eml",
> "subject": "Test email within email"
> },
> {
> "Content-Encoding": "US-ASCII",
> "Content-Type": "text/plain; charset=US-ASCII",
> "Multipart-Boundary": 
> "_004_8075737674787666767166806676697476787366657271727266777_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "5",
> "embeddedResourceType": "ATTACHMENT"
> }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to