Thierry Guérin created TIKA-3687:
------------------------------------

             Summary: Email file detected as text/html
                 Key: TIKA-3687
                 URL: https://issues.apache.org/jira/browse/TIKA-3687
             Project: Tika
          Issue Type: Bug
    Affects Versions: 2.3.0
            Reporter: Thierry Guérin
         Attachments: testRFC822-ARC.eml

The attached email (which I redacted from a real email received from Office365) 
is detected a HTML.

This is because it contains ARC -* headers, but they're not the first one, so 
the matcher that looks for ARC- headers fails, and the matcher for regular 
'From' header also fails because the 'From' headers occurs after 1024 
characters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to