[
https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17501957#comment-17501957
]
ASF GitHub Bot commented on TIKA-3687:
--------------------------------------
lfcnassif commented on pull request #520:
URL: https://github.com/apache/tika/pull/520#issuecomment-1059962861
Also, if we look for additional headers if X|DKIM|ARC matches at the
beginning, why not look for other headers if they match in a wide range of
positions? (ok there is the \n but it is pretty common in txt files).
Maybe we could use a regex to put all definitions together, regardless if
they are at the beginning at the file or at the beginning of lines in the first
1024...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
> Email file detected as text/html
> --------------------------------
>
> Key: TIKA-3687
> URL: https://issues.apache.org/jira/browse/TIKA-3687
> Project: Tika
> Issue Type: Bug
> Affects Versions: 2.3.0
> Reporter: Thierry Guérin
> Priority: Minor
> Fix For: 2.3.1
>
> Attachments: testRFC822-ARC.eml
>
>
> The attached email (which I redacted from a real email received from
> Office365) is detected a HTML.
> This is because it contains ARC * headers, but they're not the first one, so
> the matcher that looks for ARC headers fails, and the matcher for regular
> 'From' header also fails because the 'From' headers occurs after 1024
> characters.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)