[ 
https://issues.apache.org/jira/browse/TIKA-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16905303#comment-16905303
 ] 

Joshua Turner commented on TIKA-2921:
-------------------------------------

Local testing with the following as a replacement for the first part of 
MailContentHandler.handleInlineBodyPart() yields the content I expect from the 
affected mail items:

{code:java}
    private void handleInlineBodyPart(BodyContents part) throws MimeException, 
IOException {
        String contentType = part.metadata.get(Metadata.CONTENT_TYPE);
        Parser parser = null;
        boolean inlineText = false;
        if (MediaType.TEXT_HTML.toString().equalsIgnoreCase(contentType)) {
            parser
                    = 
EmbeddedDocumentUtil.tryToFindExistingLeafParser(HtmlParser.class, 
parseContext);
            if (parser == null) {
                parser = new HtmlParser();
                parseContext.set(Parser.class, parser);
            }
        } else if ("application/rtf".equalsIgnoreCase(contentType)) {
            parser
                    = 
EmbeddedDocumentUtil.tryToFindExistingLeafParser(RTFParser.class, parseContext);
            if (parser == null) {
                parser = new RTFParser();
                parseContext.set(Parser.class, parser);
            }
        } else if 
(MediaType.TEXT_PLAIN.toString().equalsIgnoreCase(contentType)) {
            parser
                    = 
EmbeddedDocumentUtil.tryToFindExistingLeafParser(TXTParser.class, parseContext);
            if (parser == null) {
                parser = 
EmbeddedDocumentUtil.tryToFindExistingLeafParser(TextAndCSVParser.class, 
parseContext);
                if (parser == null) {
                    parser = new TextAndCSVParser();
                    parseContext.set(Parser.class, parser);
                }
                inlineText = true;
            }
        }

{code}


> Tika discarding bodies of inline MIME elements in RFC822 email
> --------------------------------------------------------------
>
>                 Key: TIKA-2921
>                 URL: https://issues.apache.org/jira/browse/TIKA-2921
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.22
>         Environment: Reproducible on Java 8 and 11 on both Linux and Win 10.
>            Reporter: Joshua Turner
>            Priority: Major
>
> Given an rfc822 email that has two inline body parts (such as the one 
> attached), MailContentHandler's handleInlineBodyPart() method correctly 
> identifies the body part that should be emitted as the principal content of 
> the mail item, but then uses 
> EmbeddedDocumentUtil.tryToFindExistingLeafParser() to find a parser for that 
> part. If no existing leaf parser is found, it simply gives up and treats the 
> given part as an attachment.
> IMHO, the correct behaviour would be to create the necessary parser if none 
> is found, insert it into the parsing context, and use it to extract the 
> content of the selected body part.
> In the meantime, I'm working around the issue by creating and registering a 
> custom EmbeddedDocumentExtractor to guess whether it's been called by the 
> RFC822Parser by looking at the "X-Parsed-By" metadata value. When triggered, 
> it looks at the Content-Type of the passed-in metadata, and if it's plain 
> text or email, it creates a new TXTParser or HTMLParser and a new context, 
> and has them parse into the passed-in ContentHandler. It works, but it's 
> pretty hacky. It'd be far better to have the change in behaviour suggested 
> above. 
> [^test.eml]
> ^I've attached the email inline because using the attachment field yields an 
> error: "JIRA could not attach the file as there was a missing token. Please 
> try attaching the file again." I tried twice with the same error returned.^



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to