[ 
https://issues.apache.org/jira/browse/TIKA-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16556396#comment-16556396
 ] 

Hudson commented on TIKA-2688:
------------------------------

SUCCESS: Integrated in Jenkins build Tika-trunk #1526 (See 
[https://builds.apache.org/job/Tika-trunk/1526/])
TIKA-2688 via Yury Kats (tallison: 
[https://github.com/apache/tika/commit/aac3af4ccb9da07b1ade1a57ed5e015dbedb17c4])
* (edit) tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
* (add) 
tika-parsers/src/test/resources/test-documents/testMBOX_lengthy_x-headers.mbox


> MBOX not recognized when unknown X-headers are present
> ------------------------------------------------------
>
>                 Key: TIKA-2688
>                 URL: https://issues.apache.org/jira/browse/TIKA-2688
>             Project: Tika
>          Issue Type: Bug
>          Components: detector, mime
>    Affects Versions: 1.18
>            Reporter: Yury Kats
>            Assignee: Tim Allison
>            Priority: Major
>             Fix For: 1.19, 2.0.0
>
>
> This is a spin off from TIKA-2578
> I have mbox files that are not being recognized as such because they have X- 
> headers at the top.
> Current config:
> {noformat}
>   <mime-type type="application/mbox">
>     <!-- MBOX files start with "From [sender] [date]" -->
>     <!-- To avoid false matches, check for other headers after that -->
>     <magic priority="70">
>       <match value="From " type="string" offset="0">
>          <match value="\nFrom: " type="string" offset="32:256"/>
>          <match value="\nDate: " type="string" offset="32:256"/>
>          <match value="\nSubject: " type="string" offset="32:256"/>
>          <match value="\nDelivered-To: " type="string" offset="32:256"/>
>          <match value="\nReceived: by " type="string" offset="32:256"/>
>          <match value="\nReceived: via " type="string" offset="32:256"/>
>          <match value="\nReceived: from " type="string" offset="32:256"/>
>          <match value="\nMime-Version: " type="string" offset="32:256"/>
>       </match>
> {noformat}
> mbox file:
> {noformat}
> From "[email protected]" Wed Jan 30 18:07:01 2002
> X-EDO-Dataset: EnronData.org Abridged Email Dataset (AED)
> X-EDO-AED-Version: 1.0
> X-EDO-AED-License: Creative Commons Attribution 3.0 United States;
>  http://creativecommons.org/licenses/by/3.0/us/;
>  To provide attribution, please cite to "EnronData.org."
> X-EDO-AED-ID: 516172
> X-EDO-AED-File: zipper-a/inbox/38.eml
> Message-ID: <8269158.1075842014924.JavaMail.evans@thyme>
> Date: Wed, 30 Jan 2002 15:07:01 -0800 (PST)
> From: [email protected]
> To: [email protected]
> Subject: RE: Var simulation
> ...
> {noformat}
> MBOX rule looks for additional headers only in the first 256 bytes, which is 
> not enough when X- headers are present.
> Side-note: prior to 1.17 such mbox was detected as text/plain. As of 1.17 it 
> is detected as message/rfc822 (due to TIKA-2594 that added a rule for 
> Message-ID being present in the first 1000 bytes). Neither is correct!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to