[
https://issues.apache.org/jira/browse/TIKA-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14372204#comment-14372204
]
Luis Filipe Nassif edited comment on TIKA-1267 at 3/20/15 10:10 PM:
--------------------------------------------------------------------
Detection only by extension is very poor because many mail apps do not use any
extension. Maybe we can make application/mbox a subclass of message/rfc822 (not
semantically true) after widening rfc822 magic offsets. Does default detector
check for parent magics?
Or maybe include some rfc822 extended magics as a prerequisite because they
should be present in the first email:
{code}
<mime-type type="application/mbox">
<magic priority="70">
<match value="From " type="string" offset="0">
<match value="Forward\ to" type="string" offset="0:1024"/>
<match value="Return-Path:" type="stringignorecase" offset="0:1024"/>
<match value="Received:" type="stringignorecase" offset="0:1024"/>
<match value="Message-ID:" type="stringignorecase" offset="0:1024"/>
</match>
</magic>
<sub-class-of type="text/plain"/>
<glob pattern="*.mbox"/>
</mime-type>
{code}
was (Author: lfcnassif):
Detection only by extension is very poor because many mail apps do not use any
extension. Maybe we can make application/mbox a subclass of message/rfc822
(after widening rfc822 magic offsets, not semantically true). Does default
detector check for parent magics?
Or maybe include some rfc822 extended magics as a prerequisite because they
should be present in the first email:
{code}
<mime-type type="application/mbox">
<magic priority="70">
<match value="From " type="string" offset="0">
<match value="Forward\ to" type="string" offset="0:1024"/>
<match value="Return-Path:" type="stringignorecase" offset="0:1024"/>
<match value="Received:" type="stringignorecase" offset="0:1024"/>
<match value="Message-ID:" type="stringignorecase" offset="0:1024"/>
</match>
</magic>
<sub-class-of type="text/plain"/>
<glob pattern="*.mbox"/>
</mime-type>
{code}
> Improve Mbox file detection
> ---------------------------
>
> Key: TIKA-1267
> URL: https://issues.apache.org/jira/browse/TIKA-1267
> Project: Tika
> Issue Type: Improvement
> Components: mime
> Affects Versions: 1.5
> Reporter: Luis Filipe Nassif
> Priority: Minor
>
> Could we add to application/mbox mime-type definition code below:
> {code}
> <magic priority="70">
> <match value="From " type="string" offset="0"/>
> </magic>
> {code}
> Or is it too common out there?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)