Re: envelope, bodystructure & character set?

petite_abeille Thu, 19 Aug 2004 11:24:19 -0700

Hi Cyrus,

Current versions of Mulberry do not support 2-byte character sets, so that is to be expected. Our next release will add full support for 2-byte charsets.

Ok. Another thing that will be nice to fix is that on Mac OS X, Mulberry insists on setting itself as the default email client. Repetitively. This is extremely annoying :)

First of all bare 8-bit characters in headers and MIME parameters are illegal and anyone sending such messages will have to expect them to not be readable by others.


Ok. US-ASCII rules.

Given that, making a best effort to present something sensible to the client might be possible. However, how do you know what the original character set is?

I have all the original raw messages. They all seem to be encoded properly as far as I can tell.

In your example what was the original character set of the chinese characters? Was it gb2312, big5, utf8?


In fact it was Chinese character in Japanese (ISO-2022-JP).

Blindly encoding the 8bit as utf8 is not guaranteed to work. Other servers take several different approaches to this: some simply replace all 8bit characters with 'X' or '?' characters, others will pass the 8-bit data as-is and leave it up to the client to decide what to do. There are algorithms that attempt a best effort 'guess' as to the charset of 8bit data and you could apply those and use the result for the MIME encoding.

The other thing to note is that =?...?= etc encoding is not allowed in MIME parameters (e.g. 'name=' in 'Content-Type'). Sadly some vendors chose to have their clients generate parameters with that encoding and now other clients have to be prepared to handle that otherwise users complain :-( The proper way to do parameter charset encoding is to follow rfc2231.


Yep. My bad. I used the wrong encoding method there.

That said, I suspect that my problem runs deeper than this... consider the following MIME part:

Content-Type: application/octet-stream;
        x-unix-mode=0644;
        name="_Untitl?d.txt"
Content-Disposition: attachment;
        filename*=ISO-8859-1''%5FUntitl%EBd.txt
Content-Transfer-Encoding: 7bit

My corresponding IMAP BODYSTRUCTURE looks like this:

("application" "octet-stream" ("name" "_Untitl?d.txt" "x-unix-mode" "0644") NIL NIL "7bit" 16)

Note that the BODYSTRUCTURE reports the content of "Content-Type: name=", which is plain ASCII and doesn't have anything to say about "Content-Disposition: filename*=" where the properly encoded filename resides.

The result of this mismatch is that a client which relies on BODYSTRUCTURE always get the "degenerated" attachment name. On the other hand, clients that simply fetch the raw MIME part and handle the MIME parsing by themselves seems to properly use the Content-Disposition's filename and display the correctly encoded file name...

Does any of this make sense?

What should I do on the server side? Substitute the encoded Content-Disposition's filename* for Content-Type's name? Ignore it?

However, whatever you choose to support, lack of knowledge of the original charset will again be an issue.

Agree. But usually, I have access to the original raw message, so this shouldn't be an issue in this case.

Cheers,

PA.

Re: envelope, bodystructure & character set?

Reply via email to