[jira] [Commented] (TIKA-1162) content-type/charset problem with RFC822Parser

2013-09-30 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13781922#comment-13781922
 ] 

Tim Allison commented on TIKA-1162:
---

Dear Colleague,
  I'm on paternity leave.  Will be back part time on October 14.

   Best,

Tim



 content-type/charset problem with RFC822Parser
 --

 Key: TIKA-1162
 URL: https://issues.apache.org/jira/browse/TIKA-1162
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Maciej Lizewski

 RFC822Parser (mime mail) uses MailContentHandler which internally uses 
 AutoDetectParser to handle each mime part. The problem is that 
 MailContentHandler reads mime part headers and sets CONTENT_TYPE and 
 CONTENT_ENCODING metadata properly and passes this metadata to 
 AutoDetectParser::parse method. But that method ignores those headers and 
 overwrites it:
 MediaType type = this.getDetector().detect(tis, metadata);
 metadata.set(Metadata.CONTENT_TYPE, type.toString());
 this leads to some additional recursion loops (Detector returns 
 message/rfc822 mime type instead of proper mimetype for current mime part) 
 and finally somehow it skips out of the loop but without proper content-type 
 and content-encoding headers...
 My proposition is to add check if metadata already contains CONTENT_TYPE in 
 AutoDetectPArser::parse and in such case do not override it. If this is not 
 valid behavior in general - then RFC822Parser should use custom parser in 
 MailContentHandler which respects passed content-type...



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (TIKA-1162) content-type/charset problem with RFC822Parser

2013-08-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13742129#comment-13742129
 ] 

Tim Allison commented on TIKA-1162:
---

Would you be willing to attach a document/test case that triggers this issue?

 content-type/charset problem with RFC822Parser
 --

 Key: TIKA-1162
 URL: https://issues.apache.org/jira/browse/TIKA-1162
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Maciej Lizewski

 RFC822Parser (mime mail) uses MailContentHandler which internally uses 
 AutoDetectParser to handle each mime part. The problem is that 
 MailContentHandler reads mime part headers and sets CONTENT_TYPE and 
 CONTENT_ENCODING metadata properly and passes this metadata to 
 AutoDetectParser::parse method. But that method ignores those headers and 
 overwrites it:
 MediaType type = this.getDetector().detect(tis, metadata);
 metadata.set(Metadata.CONTENT_TYPE, type.toString());
 this leads to some additional recursion loops (Detector returns 
 message/rfc822 mime type instead of proper mimetype for current mime part) 
 and finally somehow it skips out of the loop but without proper content-type 
 and content-encoding headers...
 My proposition is to add check if metadata already contains CONTENT_TYPE in 
 AutoDetectPArser::parse and in such case do not override it. If this is not 
 valid behavior in general - then RFC822Parser should use custom parser in 
 MailContentHandler which respects passed content-type...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira