On Tue, 2012-02-14 at 11:53 +0000, Sharma, Ashish wrote:
> Oleg,
>
> I am using 'mime4j' as follows:
>
> MimeConfig mime4jParserConfig = new MimeConfig();
> BodyDescriptorBuilder bodyDescriptorBuilder = new
> DefaultBodyDescriptorBuilder();
> MimeStreamParser mime4jParser = new
> MimeStreamParser(mime4jParserConfig,DecodeMonitor.SILENT,bodyDescriptorBuilder);
> mime4jParser.setContentDecoding(true);
> mime4jParser.setContentHandler(contentHandler);
>
> mime4jParser.parse(rawEmailFile);
>
> return ((CustomContentHandler)contentHandler).getEmail();
>
> Here, as you can see I am using the content decoding as provided by mime4j
> for email body parts.
>
> The contentHandler that I am using is just listening for basic events and is
> of following type:
>
> public class CustomContentHandler extends AbstractContentHandler {
>
> public void field(Field field) throws MimeException {}
>
>
> public void body(BodyDescriptor bd, InputStream is) throws
> MimeException, IOException {
>
> ((MaximalBodyDescriptor)bd).setCharset(getFallbackCharset(bd.getCharset()));
>
> }
>
> ...
>
> I modified the code in 'MaximalBodyDescriptor' to set charset in my
> contentHandler as you hinted.
>
There is absolutely no need or good reason for modifying
MaximalBodyDescriptor. Just use a different charset when processing body
content.
Oleg
> This arrangement solved my problem of character corruption.
>
> But the problem I am having is that for the above code to work I need to
> modify the code in 'mime4j' that I want to avoid.
>
> Can you suggest some workaround here?
>
> Thanks
> Ashish
>
> -----Original Message-----
> From: Oleg Kalnichevski [mailto:[email protected]]
> Sent: Tuesday, February 14, 2012 2:42 AM
> To: [email protected]
> Subject: RE: Character corruption with Traditional chinese
>
> On Mon, 2012-02-13 at 14:58 +0000, Sharma, Ashish wrote:
> > Hi,
> >
> > Since I have no control on the email clients sending the mails, kindly
> > suggests suitable measures that I can take up on my end to mitigate the
> > problem of character corruption.
> >
> > I think modifying the charset during email body decoding will work for such
> > cases, can somebody post relevant api hooks of mime4j that I can use for
> > the idea that I have put forward (is it feasible too?) ?
> >
> > Thanks
> > Ashish
> >
>
> I am not sure I understand the problem you are having. MimeStreamParser
> passes an instance of BodyDescriptor for each body part it encounters.
> BodyDescriptor contains the charset of the body part (if specified)
> among other things. It is up to individual ContentHandler implementation
> to decide whether or not that charset is valid. ContentHandler can
> always choose to use a different charset encoding instead of the one
> specified by the BodyDescriptor.
>
> Oleg
>
> > -----Original Message-----
> > From: Tze-Kei Lee [mailto:[email protected]]
> > Sent: Monday, February 13, 2012 5:45 PM
> > To: [email protected]
> > Subject: Re: Character corruption with Traditional chinese
> >
> > Hi,
> >
> > It looks like the email client composed the email made mistake when
> > pick charset.
> >
> > GB 2312 contains only Simplified Chinese while CP 932 or GB 18030 is
> > extended to include Traditional Chinese (and Japanese, Korean), and
> > the first sentence in the email is using the extended code points.
> >
> > Best Regards
> >
> > Tze-Kei
> >
> > On Mon, Feb 13, 2012 at 7:32 PM, Sharma, Ashish <[email protected]>
> > wrote:
> > > Hi,
> > >
> > > I use mime4j 0.7.2 for email parsing.
> > >
> > > I am getting problem of character set corruption for Traditional Chinese
> > > characters.
> > >
> > > Sample email that is creating problems is at:
> > >
> > > http://pastebin.com/Q38VXsLb
> > >
> > > Here I noticed that when the email is parsed with default charset
> > > encoding (charset encoding that was recived from email server) of :
> > >
> > > charset="gb2312"
> > >
> > > I get the character set corruption, while if I manually change this
> > > charset encoding in the email stream to :
> > >
> > > charset="gb18030"
> > >
> > > and then parse it via mime4j, there is no character corruption.
> > >
> > > Can somebody please explain why I am getting this behavior?
> > >
> > > Moreover is there a way in mime4j where I can substitute character sets
> > > for the above kind of specific cases?
> > >
> > > Thanks
> > > Ashish
> > >
> > >
> > >
>
>