[
https://issues.apache.org/jira/browse/TIKA-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085842#comment-16085842
]
Matthew Caruana Galizia commented on TIKA-2042:
-----------------------------------------------
[~gagravarr] thank you - that fixes the detection of at least one of the MBOX
files. Now the problem is that that when the email streams get passed to the
delegate parser by the ParsingEmbeddedDocumentExtractor implementation, they're
detected as text/html instead of message/rfc822.
> MBOX file detected wrongly as text/html
> ---------------------------------------
>
> Key: TIKA-2042
> URL: https://issues.apache.org/jira/browse/TIKA-2042
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.13
> Environment: Ubuntu 14.04, Apache Tika 1.13 and 1.14 nightly at the
> time of this writing
> Reporter: Vjeran Marcinko
> Fix For: 1.14
>
> Attachments: clojure.mbox, mbox_email_section.txt, mbox_header.txt
>
>
> MBOX file doesn't get recognized via "magic detection" mechanism as
> "application/mbox", but wrongly as "text/html".
> Workaround for this in Tika 1.13 is achieved by placing following in
> custom-mimetypes.xml, as suggested on mailing list (priority has to be larger
> than message/rfc822):
> <mime-type type="application/mbox">
> <magic priority="70">
> <match value="From " type="string" offset="0"/>
> </magic>
> <glob pattern="*.mbox"/>
> </mime-type>
> Sample MBOX file is attached.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)