[ https://issues.apache.org/jira/browse/TIKA-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148117#comment-16148117 ]
Matthew Caruana Galizia commented on TIKA-2454: ----------------------------------------------- I don't know if the same thing can be done wholesale for mbox files. There are four variants of emails in mbox files: http://www.forensicswiki.org/wiki/MBox#MBOX_File_Variants > Emails extracted from PSTs detected as unexpected file types > ------------------------------------------------------------ > > Key: TIKA-2454 > URL: https://issues.apache.org/jira/browse/TIKA-2454 > Project: Tika > Issue Type: Bug > Components: detector, parser > Affects Versions: 1.16 > Reporter: Matthew Caruana Galizia > Fix For: 1.17 > > > This issue is severe. The Outlook PST parser extracts a string for the body > of every email and passes that string to the {{EmbeddedDocumentExtractor}}. > However, no content type is set on the {{Metadata}} object passed to the > extractor. Therefore, if for example, the body of the email starts with the > string "From John Smith." (for example, when an email was forwarded), then > body of the email is detected as {{application/mbox}} and parsed as though it > were an mbox file. > I think the immediate fix for this issue is to force the type of the email to > {{text/plain}} and for it to be parsed as such. -- This message was sent by Atlassian JIRA (v6.4.14#64029)