RE: Character corruption with Traditional chinese

Oleg Kalnichevski Tue, 14 Feb 2012 04:20:28 -0800

On Tue, 2012-02-14 at 11:53 +0000, Sharma, Ashish wrote:
> Oleg,
> 
> I am using 'mime4j' as follows:
> 
>               MimeConfig mime4jParserConfig = new MimeConfig();
>               BodyDescriptorBuilder bodyDescriptorBuilder = new 
> DefaultBodyDescriptorBuilder();
>               MimeStreamParser mime4jParser = new 
> MimeStreamParser(mime4jParserConfig,DecodeMonitor.SILENT,bodyDescriptorBuilder);
>               mime4jParser.setContentDecoding(true);
>               mime4jParser.setContentHandler(contentHandler);         
>               
>               mime4jParser.parse(rawEmailFile);
>               
>               return ((CustomContentHandler)contentHandler).getEmail();
> 
> Here, as you can see I am using the content decoding as provided by mime4j 
> for email body parts.
> 
> The contentHandler that I am using is just listening for basic events and is 
> of following type:
> 
>       public class CustomContentHandler extends AbstractContentHandler {      
>       
>                public void field(Field field) throws MimeException {} 
>       
> 
>               public void body(BodyDescriptor bd, InputStream is) throws 
> MimeException, IOException {
>               
> ((MaximalBodyDescriptor)bd).setCharset(getFallbackCharset(bd.getCharset()));  
>           
>               }
> 
>               ...
> 
> I modified the code in 'MaximalBodyDescriptor' to set charset in my 
> contentHandler as you hinted.
>


There is absolutely no need or good reason for modifying
MaximalBodyDescriptor. Just use a different charset when processing body
content.

Oleg


> This arrangement solved my problem of character corruption.
> 
> But the problem I am having is that for the above code to work I need to 
> modify the code in 'mime4j' that I want to avoid.
> 
> Can you suggest some workaround here?
> 
> Thanks
> Ashish
> 
> -----Original Message-----
> From: Oleg Kalnichevski [mailto:[email protected]] 
> Sent: Tuesday, February 14, 2012 2:42 AM
> To: [email protected]
> Subject: RE: Character corruption with Traditional chinese
> 
> On Mon, 2012-02-13 at 14:58 +0000, Sharma, Ashish wrote:
> > Hi,
> > 
> > Since I have no control on the email clients sending the mails, kindly 
> > suggests suitable measures that I can take up on my end to mitigate the 
> > problem of character corruption.
> > 
> > I think modifying the charset during email body decoding will work for such 
> > cases, can somebody post relevant api hooks of mime4j that I can use for 
> > the idea that I have put forward (is it feasible too?) ?
> > 
> > Thanks
> > Ashish
> > 
> 
> I am not sure I understand the problem you are having. MimeStreamParser
> passes an instance of BodyDescriptor for each body part it encounters.
> BodyDescriptor contains the charset of the body part (if specified)
> among other things. It is up to individual ContentHandler implementation
> to decide whether or not that charset is valid. ContentHandler can
> always choose to use a different charset encoding instead of the one
> specified by the BodyDescriptor.
> 
> Oleg 
> 
> > -----Original Message-----
> > From: Tze-Kei Lee [mailto:[email protected]] 
> > Sent: Monday, February 13, 2012 5:45 PM
> > To: [email protected]
> > Subject: Re: Character corruption with Traditional chinese
> > 
> > Hi,
> > 
> > It looks like the email client composed the email made mistake when
> > pick charset.
> > 
> > GB 2312 contains only Simplified Chinese while CP 932 or GB 18030 is
> > extended to include Traditional Chinese (and Japanese, Korean), and
> > the first sentence in the email is using the extended code points.
> > 
> > Best Regards
> > 
> > Tze-Kei
> > 
> > On Mon, Feb 13, 2012 at 7:32 PM, Sharma, Ashish <[email protected]> 
> > wrote:
> > > Hi,
> > >
> > > I use mime4j 0.7.2 for email parsing.
> > >
> > > I am getting problem of character set corruption for Traditional Chinese 
> > > characters.
> > >
> > > Sample email that is creating problems is at:
> > >
> > > http://pastebin.com/Q38VXsLb
> > >
> > > Here I noticed that when the email is parsed with default charset 
> > > encoding (charset encoding that was recived from email server) of :
> > >
> > > charset="gb2312"
> > >
> > > I get the character set corruption, while if I manually change this 
> > > charset encoding in the email stream to :
> > >
> > > charset="gb18030"
> > >
> > > and then parse it via mime4j, there is no character corruption.
> > >
> > > Can somebody please explain why I am getting this behavior?
> > >
> > > Moreover is there a way in mime4j where I can substitute character sets 
> > > for the above kind of specific cases?
> > >
> > > Thanks
> > > Ashish
> > >
> > >
> > >
> 
>

RE: Character corruption with Traditional chinese

Reply via email to