[ https://issues.apache.org/jira/browse/TIKA-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548496#comment-16548496 ]
Yury Kats commented on TIKA-2688: --------------------------------- To be consistent with TIKA-2578, I would suggest to change the mbox magic to {noformat} <magic priority="70"> <match value="From " type="string" offset="0"> <match value="\nFrom: " type="string" offset="32:256"/> <match value="\nDate: " type="string" offset="32:256"/> <match value="\nSubject: " type="string" offset="32:256"/> <match value="\nDelivered-To: " type="string" offset="32:256"/> <match value="\nReceived: by " type="string" offset="32:256"/> <match value="\nReceived: via " type="string" offset="32:256"/> <match value="\nReceived: from " type="string" offset="32:256"/> <match value="\nMime-Version: " type="string" offset="32:256"/> <match value="X-" type="stringignorecase" offset="32:256"> <match value="\nFrom: " type="string" offset="32:8192"/> <match value="\nDate: " type="string" offset="32:8192"/> <match value="\nSubject: " type="string" offset="32:8192"/> <match value="\nDelivered-To: " type="string" offset="32:8192"/> <match value="\nReceived: by " type="string" offset="32:8192"/> <match value="\nReceived: via " type="string" offset="32:8192"/> <match value="\nReceived: from " type="string" offset="32:8192"/> <match value="\nMime-Version: " type="string" offset="32:8192"/> </match> </match> </magic> {noformat} > MBOX not recognized when unknown X-headers are present > ------------------------------------------------------ > > Key: TIKA-2688 > URL: https://issues.apache.org/jira/browse/TIKA-2688 > Project: Tika > Issue Type: Bug > Components: detector, mime > Affects Versions: 1.18 > Reporter: Yury Kats > Priority: Major > > This is a spin off from TIKA-2578 > I have mbox files that are not being recognized as such because they have X- > headers at the top. > Current config: > {noformat} > <mime-type type="application/mbox"> > <!-- MBOX files start with "From [sender] [date]" --> > <!-- To avoid false matches, check for other headers after that --> > <magic priority="70"> > <match value="From " type="string" offset="0"> > <match value="\nFrom: " type="string" offset="32:256"/> > <match value="\nDate: " type="string" offset="32:256"/> > <match value="\nSubject: " type="string" offset="32:256"/> > <match value="\nDelivered-To: " type="string" offset="32:256"/> > <match value="\nReceived: by " type="string" offset="32:256"/> > <match value="\nReceived: via " type="string" offset="32:256"/> > <match value="\nReceived: from " type="string" offset="32:256"/> > <match value="\nMime-Version: " type="string" offset="32:256"/> > </match> > {noformat} > mbox file: > {noformat} > From "naveen.andr...@enron.com" Wed Jan 30 18:07:01 2002 > X-EDO-Dataset: EnronData.org Abridged Email Dataset (AED) > X-EDO-AED-Version: 1.0 > X-EDO-AED-License: Creative Commons Attribution 3.0 United States; > http://creativecommons.org/licenses/by/3.0/us/; > To provide attribution, please cite to "EnronData.org." > X-EDO-AED-ID: 516172 > X-EDO-AED-File: zipper-a/inbox/38.eml > Message-ID: <8269158.1075842014924.JavaMail.evans@thyme> > Date: Wed, 30 Jan 2002 15:07:01 -0800 (PST) > From: naveen.andr...@enron.com > To: andy.zip...@enron.com > Subject: RE: Var simulation > ... > {noformat} > MBOX rule looks for additional headers only in the first 256 bytes, which is > not enough when X- headers are present. > Side-note: prior to 1.17 such mbox was detected as text/plain. As of 1.17 it > is detected as message/rfc822 (due to TIKA-2594 that added a rule for > Message-ID being present in the first 1000 bytes). Neither is correct! -- This message was sent by Atlassian JIRA (v7.6.3#76005)