Joshua Turner created TIKA-2921:
-----------------------------------

             Summary: Tika discarding bodies of inline MIME elements in RFC822 
email
                 Key: TIKA-2921
                 URL: https://issues.apache.org/jira/browse/TIKA-2921
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.22
         Environment: Reproducible on Java 8 and 11 on both Linux and Win 10.
            Reporter: Joshua Turner


Given an rfc822 email that has two inline body parts (such as the one 
attached), MailContentHandler's handleInlineBodyPart() method correctly 
identifies the body part that should be emitted as the principal content of the 
mail item, but then uses EmbeddedDocumentUtil.tryToFindExistingLeafParser() to 
find a parser for that part. If no existing leaf parser is found, it simply 
gives up and treats the given part as an attachment.

IMHO, the correct behaviour would be to create the necessary parser if none is 
found, insert it into the parsing context, and use it to extract the content of 
the selected body part.

In the meantime, I'm working around the issue by creating and registering a 
custom EmbeddedDocumentExtractor to guess whether it's been called by the 
RFC822Parser by looking at the "X-Parsed-By" metadata value. When triggered, it 
looks at the Content-Type of the passed-in metadata, and if it's plain text or 
email, it creates a new TXTParser or HTMLParser and a new context, and has them 
parse into the passed-in ContentHandler. It works, but it's pretty hacky. It'd 
be far better to have the change in behaviour suggested above. 

[^test.eml]

^I've attached the email inline because using the attachment field yields an 
error: "JIRA could not attach the file as there was a missing token. Please try 
attaching the file again." I tried twice with the same error returned.^



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to