[issue19063] Python 3.3.3 encodes emails containing non-ascii data as 7bit

R. David Murray Wed, 20 Nov 2013 11:21:22 -0800

R. David Murray added the comment:

Vajrasky: thanks for taking a crack at this, but, well, there are a lot of 
subtleties involved here, due to the way the organic growth of the email 
package over many years has led to some really bad design issues.


It took me a lot of time to boot back up my understanding of how all this stuff 
hangs together (answer: badly).  After wandering down many blind alleys, the 
problem turns out to be yet one more disconnect in the model.  We previously 
fixed the issue where if set_payload was passed binary data bad things would 
happen.  That made the model more consistent, in that _payload was now a 
surrogateescaped string when the payload was specified as binary data.

But what the model *really* needs is that _payload *always* be an 
ascii+surrogateescape string, and never a full unicode string.  (Yeah, this is 
a sucky model...it ought to always be binary instead, but we are dealing with 
legacy code here.)

Currently it can be a unicode string.  If it is, set_charset turns it into an 
ascii only string by encoding it with the qp or base64 CTE.  This is pretty 
much just by luck, though.

If you set body_encode to None what happens is that the encode_7or8bit encoder 
thinks the string is 7bit because it does get_payload(decode=True) which, 
because the model invariant was broken, turns into a raw-unicode-escape string, 
which is a 7bit representation.  That doesn't affect the payload, but it does 
result in wrong CTE being used.

The fix is to fix the model invariant by turning a unicode string passed in to 
set_payload into an ascii+surrogateescape string with the escaped bytes being 
the unicode encoded to the output charset.

Unfortunately it is also possible to call set_payload without a charset, and 
*then* call set_charset.  To keep from breaking the code of anyone currently 
doing that, I had to allow a full unicode _payload, and detect it in 
set_charset.

My plan is to fix that in 3.4, causing a backward compatibility break because 
it will no longer be possible to call set_payload with a unicode string 
containing non-ascii if you don't also provide a character set.  I believe this 
is an acceptable break, since otherwise you *must* leave the model in an 
ambiguous state, and you have the possibility "leaking" unicode characters out 
into your wire-format message, which would ultimately result in either an 
exception at serialization time or, worse, mojibake.

Patch attached.

----------
stage:  -> patch review
type:  -> behavior
versions:  -Python 3.2
Added file: http://bugs.python.org/file32730/support_8bit_charset_cte.patch

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue19063>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue19063] Python 3.3.3 encodes emails containing non-ascii data as 7bit

Reply via email to