[ 
https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500974#comment-17500974
 ] 

ASF GitHub Bot commented on TIKA-3687:
--------------------------------------

SchwingSK commented on a change in pull request #520:
URL: https://github.com/apache/tika/pull/520#discussion_r818959865



##########
File path: tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
##########
@@ -6422,7 +6422,7 @@
       <!-- match X- DKIM- ARC- at start of file and then require at least one
            of the usual: from, received, date...but look farther into the file
            because of the X|DKIM|ARC headers-->
-      <match value="(X|DKIM|ARC)-" type="regex" offset="0">
+      <match value="(X|DKIM|ARC)-" type="regex" offset="0:1024">

Review comment:
       Good point, I like your solution better. Code changed accordingly.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


> Email file detected as text/html
> --------------------------------
>
>                 Key: TIKA-3687
>                 URL: https://issues.apache.org/jira/browse/TIKA-3687
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 2.3.0
>            Reporter: Thierry Guérin
>            Priority: Minor
>         Attachments: testRFC822-ARC.eml
>
>
> The attached email (which I redacted from a real email received from 
> Office365) is detected a HTML.
> This is because it contains ARC * headers, but they're not the first one, so 
> the matcher that looks for ARC headers fails, and the matcher for regular 
> 'From' header also fails because the 'From' headers occurs after 1024 
> characters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to