[ 
https://issues.apache.org/jira/browse/TIKA-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13742129#comment-13742129
 ] 

Tim Allison commented on TIKA-1162:
-----------------------------------

Would you be willing to attach a document/test case that triggers this issue?
                
> content-type/charset problem with RFC822Parser
> ----------------------------------------------
>
>                 Key: TIKA-1162
>                 URL: https://issues.apache.org/jira/browse/TIKA-1162
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Maciej Lizewski
>
> RFC822Parser (mime mail) uses MailContentHandler which internally uses 
> AutoDetectParser to handle each mime part. The problem is that 
> MailContentHandler reads mime part headers and sets CONTENT_TYPE and 
> CONTENT_ENCODING metadata properly and passes this metadata to 
> AutoDetectParser::parse method. But that method ignores those headers and 
> overwrites it:
>         MediaType type = this.getDetector().detect(tis, metadata);
>         metadata.set(Metadata.CONTENT_TYPE, type.toString());
> this leads to some additional recursion loops (Detector returns 
> message/rfc822 mime type instead of proper mimetype for current mime part) 
> and finally somehow it skips out of the loop but without proper content-type 
> and content-encoding headers...
> My proposition is to add check if metadata already contains CONTENT_TYPE in 
> AutoDetectPArser::parse and in such case do not override it. If this is not 
> valid behavior in general - then RFC822Parser should use custom parser in 
> MailContentHandler which respects passed content-type...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to