[jira] [Commented] (TIKA-1162) content-type/charset problem with RFC822Parser
[ https://issues.apache.org/jira/browse/TIKA-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13781922#comment-13781922 ] Tim Allison commented on TIKA-1162: --- Dear Colleague, I'm on paternity leave. Will be back part time on October 14. Best, Tim content-type/charset problem with RFC822Parser -- Key: TIKA-1162 URL: https://issues.apache.org/jira/browse/TIKA-1162 Project: Tika Issue Type: Bug Components: parser Reporter: Maciej Lizewski RFC822Parser (mime mail) uses MailContentHandler which internally uses AutoDetectParser to handle each mime part. The problem is that MailContentHandler reads mime part headers and sets CONTENT_TYPE and CONTENT_ENCODING metadata properly and passes this metadata to AutoDetectParser::parse method. But that method ignores those headers and overwrites it: MediaType type = this.getDetector().detect(tis, metadata); metadata.set(Metadata.CONTENT_TYPE, type.toString()); this leads to some additional recursion loops (Detector returns message/rfc822 mime type instead of proper mimetype for current mime part) and finally somehow it skips out of the loop but without proper content-type and content-encoding headers... My proposition is to add check if metadata already contains CONTENT_TYPE in AutoDetectPArser::parse and in such case do not override it. If this is not valid behavior in general - then RFC822Parser should use custom parser in MailContentHandler which respects passed content-type... -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (TIKA-1162) content-type/charset problem with RFC822Parser
[ https://issues.apache.org/jira/browse/TIKA-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13742129#comment-13742129 ] Tim Allison commented on TIKA-1162: --- Would you be willing to attach a document/test case that triggers this issue? content-type/charset problem with RFC822Parser -- Key: TIKA-1162 URL: https://issues.apache.org/jira/browse/TIKA-1162 Project: Tika Issue Type: Bug Components: parser Reporter: Maciej Lizewski RFC822Parser (mime mail) uses MailContentHandler which internally uses AutoDetectParser to handle each mime part. The problem is that MailContentHandler reads mime part headers and sets CONTENT_TYPE and CONTENT_ENCODING metadata properly and passes this metadata to AutoDetectParser::parse method. But that method ignores those headers and overwrites it: MediaType type = this.getDetector().detect(tis, metadata); metadata.set(Metadata.CONTENT_TYPE, type.toString()); this leads to some additional recursion loops (Detector returns message/rfc822 mime type instead of proper mimetype for current mime part) and finally somehow it skips out of the loop but without proper content-type and content-encoding headers... My proposition is to add check if metadata already contains CONTENT_TYPE in AutoDetectPArser::parse and in such case do not override it. If this is not valid behavior in general - then RFC822Parser should use custom parser in MailContentHandler which respects passed content-type... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira