[
https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Thierry Guérin updated TIKA-3687:
---------------------------------
Description:
The attached email (which I redacted from a real email received from Office365)
is detected a HTML.
This is because it contains ARC * headers, but they're not the first one, so
the matcher that looks for ARC headers fails, and the matcher for regular
'From' header also fails because the 'From' headers occurs after 1024
characters.
was:
The attached email (which I redacted from a real email received from Office365)
is detected a HTML.
This is because it contains ARC -* headers, but they're not the first one, so
the matcher that looks for ARC- headers fails, and the matcher for regular
'From' header also fails because the 'From' headers occurs after 1024
characters.
> Email file detected as text/html
> --------------------------------
>
> Key: TIKA-3687
> URL: https://issues.apache.org/jira/browse/TIKA-3687
> Project: Tika
> Issue Type: Bug
> Affects Versions: 2.3.0
> Reporter: Thierry Guérin
> Priority: Minor
> Attachments: testRFC822-ARC.eml
>
>
> The attached email (which I redacted from a real email received from
> Office365) is detected a HTML.
> This is because it contains ARC * headers, but they're not the first one, so
> the matcher that looks for ARC headers fails, and the matcher for regular
> 'From' header also fails because the 'From' headers occurs after 1024
> characters.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)