[ 
https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17540122#comment-17540122
 ] 

ASF GitHub Bot commented on TIKA-3687:
--------------------------------------

lfcnassif commented on PR #520:
URL: https://github.com/apache/tika/pull/520#issuecomment-1132891085

   @SchwingSK I detected some false positives (see TIKA-3687) after this. The 
<match value="\nX-" type="string" offset="0:1024"/> rule is not needed to 
detect the EML sample you provided in TIKA-3687. Not sure if this is the best 
option, but if we remove it, will EML files in your dataset continue to be 
detected?




> Email file detected as text/html
> --------------------------------
>
>                 Key: TIKA-3687
>                 URL: https://issues.apache.org/jira/browse/TIKA-3687
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 2.3.0
>            Reporter: Thierry Guérin
>            Priority: Minor
>             Fix For: 2.4.0
>
>         Attachments: testRFC822-ARC.eml
>
>
> The attached email (which I redacted from a real email received from 
> Office365) is detected a HTML.
> This is because it contains ARC * headers, but they're not the first one, so 
> the matcher that looks for ARC headers fails, and the matcher for regular 
> 'From' header also fails because the 'From' headers occurs after 1024 
> characters.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to