[ 
https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17501959#comment-17501959
 ] 

ASF GitHub Bot commented on TIKA-3687:
--------------------------------------

lfcnassif edited a comment on pull request #520:
URL: https://github.com/apache/tika/pull/520#issuecomment-1059962861


   Also, if we look for additional headers if X|DKIM|ARC matches at the 
beginning, why not look for other headers if they match in a wide range of 
positions? (ok there is the \n but it is pretty common in txt files).
   
   Maybe we could use a regex to put all definitions together, regardless if 
they are at the beginning of the file or at the beginning of lines in the first 
1024...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


> Email file detected as text/html
> --------------------------------
>
>                 Key: TIKA-3687
>                 URL: https://issues.apache.org/jira/browse/TIKA-3687
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 2.3.0
>            Reporter: Thierry Guérin
>            Priority: Minor
>             Fix For: 2.3.1
>
>         Attachments: testRFC822-ARC.eml
>
>
> The attached email (which I redacted from a real email received from 
> Office365) is detected a HTML.
> This is because it contains ARC * headers, but they're not the first one, so 
> the matcher that looks for ARC headers fails, and the matcher for regular 
> 'From' header also fails because the 'From' headers occurs after 1024 
> characters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to