[ 
https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500962#comment-17500962
 ] 

Thierry Guérin edited comment on TIKA-3687 at 3/3/22, 6:22 PM:
---------------------------------------------------------------

Created a pull request: [https://github.com/apache/tika/pull/520.]

I went with changing the X|DKIM|ARC headers look-ahead, which was 0 to 1024. 
Other solution was to increase 1024 to at least 8000 (I have another email in 
which the first 'From:' is around 6400) in lines 6407-6420. I'm sure someone 
here has a good idea on which version is the most efficient.

As of now, I only found examples where there was one 'Received:' header before 
the 'ARC*' headers, that's why I think that 1024 may be overkill.


was (Author: tguerin):
Created a pull request: [https://github.com/apache/tika/pull/520.]

I went with changing the X|DKIM|ARC headers look-ahead, which was 0 to 1024. 
Other solution was to increase 1024 to at least 8000 (I have another email in 
which the first 'From:' is around 6400) in lines 6407-6420. I'm sure someone 
here has a good idea on which version is the most efficient.

 

> Email file detected as text/html
> --------------------------------
>
>                 Key: TIKA-3687
>                 URL: https://issues.apache.org/jira/browse/TIKA-3687
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 2.3.0
>            Reporter: Thierry Guérin
>            Priority: Minor
>         Attachments: testRFC822-ARC.eml
>
>
> The attached email (which I redacted from a real email received from 
> Office365) is detected a HTML.
> This is because it contains ARC * headers, but they're not the first one, so 
> the matcher that looks for ARC headers fails, and the matcher for regular 
> 'From' header also fails because the 'From' headers occurs after 1024 
> characters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to