Re: Decode UTF-8 (and other) strings

2004-06-17 Thread Shawn Walker
On Thu, 17 Jun 2004 11:49:46 -0700 (PDT), Mark Crispin  
<[EMAIL PROTECTED]> wrote:

On Thu, 17 Jun 2004, Shawn Walker wrote:
I'm trying to use the utf8_mime2text() to convert a UTF-8 string to  
text.
The sample string that I'm trying to convert is:
=?UTF-8?Q?Us=C3=A9r T=C3=A9st?=
That is not a proper MIME quoted-word, and consequent utf8_mime2text()  
declines to deal with it.

What is the correct method of decoding the strings?
That string is properly MIME decoded as the text
=?UTF-8?Q?Us=C3=A9r T=C3=A9st?=
A proper MIME quoted-word, such as
=?UTF-8?Q?Us=C3=A9r_T=C3=A9st?=
would be decoded into a string with 8-bit characters.
The syntax for MIME quoted words was very carefully selected so that  
there would be no mistaken decodings.  It is necessary to consider the  
effects of RFC 2822 line wrapping with spaces.  Consequently, spaces are  
forbidden within quoted words.

I understand the argument of "why aren't you more forgiving in an  
obvious case such as this?".  The answer is that you should take a look  
at Outlook.  Many of the exploits in Outlook involve attacking Outlook's  
willingness to "just work" in "obvious cases."  Without strict rules to  
follow, software is deprived of clear guidelines to follow that will  
enable it to reject absurd cases.  To make matters worse, the code paths  
to forgive such "obvious forgivable cases" is rarely exercised; most  
data complies with the rules.  This creates a fertile breeding ground  
for bugs.

I finally figured out that quoted string is invalid.  If it's invalid, I  
say "tough", blame it on the person that formed a invalid quoted string.




Re: Decode UTF-8 (and other) strings

2004-06-17 Thread Mark Crispin
On Thu, 17 Jun 2004, Shawn Walker wrote:
I'm trying to use the utf8_mime2text() to convert a UTF-8 string to text.
The sample string that I'm trying to convert is:
=?UTF-8?Q?Us=C3=A9r T=C3=A9st?=
That is not a proper MIME quoted-word, and consequent utf8_mime2text() 
declines to deal with it.

What is the correct method of decoding the strings?
That string is properly MIME decoded as the text
=?UTF-8?Q?Us=C3=A9r T=C3=A9st?=
A proper MIME quoted-word, such as
=?UTF-8?Q?Us=C3=A9r_T=C3=A9st?=
would be decoded into a string with 8-bit characters.
The syntax for MIME quoted words was very carefully selected so that there 
would be no mistaken decodings.  It is necessary to consider the effects 
of RFC 2822 line wrapping with spaces.  Consequently, spaces are forbidden 
within quoted words.

I understand the argument of "why aren't you more forgiving in an obvious 
case such as this?".  The answer is that you should take a look at 
Outlook.  Many of the exploits in Outlook involve attacking Outlook's 
willingness to "just work" in "obvious cases."  Without strict rules to 
follow, software is deprived of clear guidelines to follow that will 
enable it to reject absurd cases.  To make matters worse, the code paths 
to forgive such "obvious forgivable cases" is rarely exercised; most data 
complies with the rules.  This creates a fertile breeding ground for bugs.

-- Mark --
http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.