Thierry Guérin created TIKA-3687: ------------------------------------ Summary: Email file detected as text/html Key: TIKA-3687 URL: https://issues.apache.org/jira/browse/TIKA-3687 Project: Tika Issue Type: Bug Affects Versions: 2.3.0 Reporter: Thierry Guérin Attachments: testRFC822-ARC.eml
The attached email (which I redacted from a real email received from Office365) is detected a HTML. This is because it contains ARC -* headers, but they're not the first one, so the matcher that looks for ARC- headers fails, and the matcher for regular 'From' header also fails because the 'From' headers occurs after 1024 characters. -- This message was sent by Atlassian Jira (v8.20.1#820001)