[ 
https://issues.apache.org/jira/browse/TIKA-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16535135#comment-16535135
 ] 

Yury Kats commented on TIKA-2685:
---------------------------------

For my own immediate needs, I modified MimeStreamParser to call 
ContentHandler#startMessage with the stream, ie instead of
{code}
handler.startMessage()
{code}
I call
{code}
handler.startMessage(mimeTokenStream.getInputStream())
{code}
I then modified the MailContentHandler to have startMessage(InputStream is) 
method where I check that it's inside of "multipart/report" and then invoke 
handleEmbedded
{code}
    public void startMessage(InputStream is) throws MimeException {
        boolean attachedMessage = parts.size() > 0 && 
parts.peek().getMimeType().equals("multipart/report");
        if (!attachedMessage) {
            startMessage();
        } else {
            Metadata submd = new Metadata();
            submd.set(Metadata.CONTENT_TYPE, "message/rfc822");
            submd.set(Metadata.CONTENT_DISPOSITION, "attachment");
            try (TikaInputStream tis = TikaInputStream.get(is)) {
                handleEmbedded(tis, submd);
            } catch (IOException e) {
                throw new MimeException(e);
            }
        }
    }
{code}

I've made a similar change for TIKA-2680, only there I detect that the message 
is an attachment in MimeEntity#advance, and then end up calling the new 
startMessage with the stream and then handleEmbedded.

Not sure if these are the best ways of solving these issues though. Looking 
forward to your take on them.

> Email attached to an undeliverable email report are not extracted
> -----------------------------------------------------------------
>
>                 Key: TIKA-2685
>                 URL: https://issues.apache.org/jira/browse/TIKA-2685
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.18
>            Reporter: Yury Kats
>            Assignee: Tim Allison
>            Priority: Major
>         Attachments: undeliverable.eml
>
>
> I have a number of email messages that are reports of deliverable emails that 
> contain the original email message as attachment.
> The original emails are parts with "Content-Type: message/rfc822" but are not 
> being recognized as such.
> Attached is an example email:
>  * Subject: Undeliverable: SRE Agent Out of Space Source:WindowsApp
>  ** Subject: Subject: SRE Agent Out of Space Source:WindowsApp
>  
> I would like to see 2 separate emails parsed out (top level undeliverable 
> report, 1st level attached original email), but I get 1 email and 2 unnamed 
> text attachments:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J  /tmp/undeliverable.eml | python -m 
> json.tool
> [
>     {
>         "Author": "postmas...@bank.com",
>         "Content-Length": "17356",
>         "Content-Type": "message/rfc822",
>         "Creation-Date": "2017-11-04T16:00:11Z",
>         "Message-From": "postmas...@bank.com",
>         "Message-To": "uatalert...@logscape.com",
>         "Message:From-Email": "postmas...@bank.com",
>         "Message:Raw-Header:Auto-Submitted": "auto-generated",
>         "Message:Raw-Header:MIME-Version": "1.0",
>         "Message:Raw-Header:Message-ID": 
> "<936a3c2c-49e5-46a0-b58c-151c024b80fe@journal.report.generator>",
>         "Message:Raw-Header:Return-Path": "<>",
>         "Message:Raw-Header:Sender": 
> "<microsoftexchange329e71ec88ae4615bbc36ab6ce411...@bank.com>",
>         "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal 
> Agent",
>         "Message:Raw-Header:X-MS-Exchange-Message-Is-Ndr": "",
>         "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "\t<1451b918-770a-4d83-b1f9-0c9c0668f...@bxts124020.eu.banknet.com>",
>         "Message:Raw-Header:X-MS-Journal-Report": "",
>         "Multipart-Boundary": "_5a8d7320-7cd6-4c1b-8e30-9616634562b2_",
>         "Multipart-Subtype": "mixed",
>         "X-Parsed-By": [
>             "org.apache.tika.parser.DefaultParser",
>             "org.apache.tika.parser.mail.RFC822Parser"
>         ],
>         "X-TIKA:parse_time_millis": "326",
>         "creator": "postmas...@bank.com",
>         "dc:creator": "postmas...@bank.com",
>         "dc:title": "Undeliverable: SRE Agent Out of Space Source:WindowsApp",
>         "dcterms:created": "2017-11-04T16:00:11Z",
>         "meta:author": "postmas...@bank.com",
>         "meta:creation-date": "2017-11-04T16:00:11Z",
>         "resourceName": "undeliverable.eml",
>         "subject": "Undeliverable: SRE Agent Out of Space Source:WindowsApp"
>     },
>     {
>         "Content-Encoding": "windows-1252",
>         "Content-Type": "text/plain; charset=windows-1252",
>         "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_",
>         "Multipart-Subtype": "report",
>         "X-Parsed-By": [
>             "org.apache.tika.parser.DefaultParser",
>             "org.apache.tika.parser.txt.TXTParser"
>         ],
>         "X-TIKA:embedded_resource_path": "/embedded-1",
>         "X-TIKA:parse_time_millis": "4",
>         "embeddedResourceType": "ATTACHMENT"
>     },
>     {
>         "Content-Encoding": "US-ASCII",
>         "Content-Type": "text/html; charset=US-ASCII",
>         "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_",
>         "Multipart-Subtype": "report",
>         "X-Parsed-By": [
>             "org.apache.tika.parser.DefaultParser",
>             "org.apache.tika.parser.html.HtmlParser"
>         ],
>         "X-TIKA:embedded_resource_path": "/embedded-2",
>         "X-TIKA:parse_time_millis": "7",
>         "embeddedResourceType": "ATTACHMENT"
>     }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to