Come to think of it, here was another oddness I just recalled: this may have been reported already, but header decoding returns mixed types depending upon the structure of the header. Converting to a str for display isn't too difficult to handle, but this seems a bit inconsistent and contrary to Python's type neutrality:
>>> from email.header import decode_header >>> S1 = 'Man where did you get that assistant?' >>> S2 = '=?utf-8?q?Man_where_did_you_get_that_assistant=3F?=' >>> S3 = 'Man where did you get that =?UTF-8?Q?assistant=3F?=' # str: don't decode() >>> decode_header(S1) [('Man where did you get that assistant?', None)] # bytes: do decode() >>> decode_header(S2) [(b'Man where did you get that assistant?', 'utf-8')] # bytes: do decode(), using raw-unicode-escape applied in package >>> decode_header(S3) [(b'Man where did you get that', None), (b'assistant?', 'utf-8')] I can make this work around this with the following code, but it feels a bit too tightly coupled to the package's internal details (further evidence that email.* can be made to work as is today, even if it may be seen as less than ideal aesthetically): parts = email.header.decode_header(rawheader) decoded = [] for (part, enc) in parts: # for all substrings if enc == None: # part unencoded? if not isinstance(part, bytes): # str: full hdr unencoded decoded += [part] # else do unicode decode else: decoded += [part.decode('raw-unicode-escape')] else: decoded += [part.decode(enc)] return ' '.join(decoded) Thanks, --Mark Lutz (http://learning-python.com, http://rmi.net/~lutz) > -----Original Message----- > From: l...@rmi.net > To: "R. David Murray" <rdmur...@bitdance.com> > Subject: Re: email package status in 3.X > Date: Sat, 12 Jun 2010 16:52:32 -0000 > > Hi David, > > All sounds good, and thanks again for all your work on this. > > I appreciate the difficulties of moving this package to 3.X > in a backward-compatible way. My suggestions stem from the fact > that it does work as is today, albeit in a less than ideal way. > > That, and I'm seeing that Python 3.X in general is still having > a great deal of trouble gaining traction in the "real world" > almost 2 years after its release, and I'd hate to see further > disincentives for people to migrate. This is a bigger issue > than both the email package and this thread, of course. > > > > 3) Type-dependent text part encoding > > > > > ... > > So, in the next releases of Python all MIMEText input should be string, > > and it will fail if you pass bytes. I consider this as email previously > > not living up to its published API, but do you think I should hack > > in a way for it to accept bytes too, for backward compatibility in the > > 3 line? > > Decoding can probably be safely delegated to package clients. > Typical email clients will probably have str for display of the > main text. They may wish to read attachments in binary mode, but > can always read in text mode instead or decode manualy, because > they need a known encoding to send the part correctly (my client > has to ask or use configurations in some cases). > > B/W compatibility probably isn't a concern; I suspect that my > temporary workaround will still work with your patch anyhow, > and this code didn't work at all for some encodings before. > > > > There are some additional cases that now require decoding per mail > > > headers today due to the str/bytes split, but these are just a > > > normal artifact of supporting Unicode character sets in general, > > > ans seem like issues for package client to resolve (e.g., the bytes > > > returned for decoded payloads in 3.X didn't play well with existing > > > str-based text processing code written for 2.X). > > > > I'm not following you here. Can you give me some more specific > > examples? Even if these "normal artifacts" must remain with > > the current API, I'd like to make things as easy as practical when > > using the new API. > > This was just a general statement about things in my own code that > didn't jive with the 3.X string model. For instance, line wrapping > logic assumed str; tkinter text widgets do much better rendering str > than the bytes fetched for decoded payloads; and my Pyedit text editor > component had to be overhauled to handle display/edit/save of payloads > of arbitrary encodings. If I remember any more specific issues with > the email package itself, I'll forward your way. > > I'll watch for an opportunity to get the book's new PyMailGUI > client code to you as a candidate test case, but please ping > me about it later if I haven't acted on this. It works well, > but largely because of all the work that went into the email > package underlying it. > > Thanks, > --Mark Lutz (http://learning-python.com, http://rmi.net/~lutz) > > > > -----Original Message----- > > From: "R. David Murray" <rdmur...@bitdance.com> > > To: l...@rmi.net > > Subject: Re: email package status in 3.X > > Date: Thu, 10 Jun 2010 10:18:48 -0400 > > > > On Thu, 10 Jun 2010 09:21:52 -0400, l...@rmi.net wrote: > > > In other words, some of my concern may have been a bit premature. > > > I hope that in the future we'll either strive for compatibility > > > or keep the current version around; it's a lot of very useful code. > > > > The plan is to have a compatibility layer that will accept calls based > > on the old API and forward appropriately to the new API. So far I'm > > thinking I can succeed in doing this in a fairly straightforward manner, > > but I won't know for sure until I get some more pieces in place. > > > > > In fact, I recommend that any new email package be named distinctly, > > > > I'm going to avoid that if I can (though the PyPI package will be > > named email6 when we publish it for public testing). If, however, > > it turns out that I can't correctly support both the old and the > > new API, then I'll have to do that. > > > > > and that the current package be retained for a number of releases to > > > come. After all the breakages that 3.X introduced in general, doing > > > the same to any email-based code seems a bit too much, especially > > > given that the current package is largely functional as is. To me, > > > after having just used it extensively, fixing its few issues seems > > > a better approach than starting from scratch. > > > > Well, the thing is, as you found, existing 2.x code needs to be fixed to > > correctly handle the distinction between strings and bytes no matter what. > > The goal is to make it easier to write correct programs, while providing > > the compatibility layer to make porting smoother. But I doubt that any > > non-trivial 2.x email program will port without significant changes, > > even if the compatibility layer is close to 100% compatible with the > > current Python3 email package, simply because the previous conflation > > of text and bytes must be untangled in order to work correctly in > > Python3, and email involves lots of transitions between text and bytes. > > > > As for "starting from scratch", it is true that the current plan involves > > considerable changes in the recommended API (in the direction of greater > > flexibility and power), but I'm hoping that significant portions of the > > code will carry forward with minor changes, and that this will make it > > easier to support the old API. > > > > > As far as other issues, the things I found are described below my > > > signature. I don't know what the utf-8 issue is that you refer > > > too; I'm able to parse and send with this encoding as is without > > > problems (both payloads and headers), but I'm probably not using the > > > interfaces you fixed, and this may be the same as one of item listed. > > > > It is, see below. > > > > > Another thought: it might be useful to use the book's email client > > > as a sort of test case for the package; it's much more rigorous in > > > the new edition because it now has to be given 3.X'Unicode model > > > (it's abut 4,900 lines of code, though not all is email-related). > > > I'd be happy to donate the code as soon as I find out what the > > > copyright will be this time around; it will be at O'Reilly's site > > > this Fall in any event. > > > > That would be great. I am planning to write my own sample ap to > > demonstrate the new API, but if I can use yours to test the compatibility > > layer that will help a lot, since I otherwise have no Python3 email > > application to test against unless I port something from Python2. > > > > > Major issues I found... > > > ------------------------------------------------------------------ > > > 1) Str required for parsing, but bytes returned from poplib > > > > > > The initial decode from bytes to str of full mail text; in > > > retrospect, probably not a major issue, since original email > > > standards called for ASCII. A 8-bit encoding like Latin-1 is > > > probably sufficient for most conforming mails. For the book, > > > I try a set of different encodings, beginning with an optional > > > configuration module setting, then ascii, latin-1, and utf-8; > > > this is probably overkill, but a GUI has to be defensive. > > > > This works (mostly) for conforming email, but some important Python email > > applications need to deal with non-conforming email. That's where the > > inability to parse bytes directly really causes problems. > > > > > 2) Binary attachments encoding > > > > > > The binary attachments byte-to-str issue that you've just > > > fixed. As I mentioned, I worked around this by passing in a > > > custom encoder that calls the original and runs an extra decode > > > step. Here's what my fix looked like in the book; your patch > > > may do better, and I will minimally add a note about the 3.1.3 > > > and 3.2 fix for this: > > > > Yeah, our patch was a lot simpler since we could fix the encoding inside > > the loop producing the encoded lines :) > > > > > 3) Type-dependent text part encoding > > > > > > There's a str/bytes confusion issue related to Unicode encodings > > > in text payload generation: some encodings require the payload to > > > be str, but others expect bytes. Unfortunately, this means that > > > clients need to know how the package will react to the encoding > > > that is used, and special-case based upon that. > > > > This was the UTF-8 bug I fixed. I shouldn't have called it "the UTF-8 > > bug", because it applies equally to the other charsets that use base64, > > as you note. I called it that because UTF-8 was where the problem was > > noticed and is mentioned in the title of the bug report. > > > > I had a suspicion that the quoted-printable encoding wasn't being done > > correctly either, so to hear that it is working for you is good news. > > There may still be bugs to find there, though. > > > > So, in the next releases of Python all MIMEText input should be string, > > and it will fail if you pass bytes. I consider this as email previously > > not living up to its published API, but do you think I should hack > > in a way for it to accept bytes too, for backward compatibility in the > > 3 line? > > > > > There are some additional cases that now require decoding per mail > > > headers today due to the str/bytes split, but these are just a > > > normal artifact of supporting Unicode character sets in general, > > > ans seem like issues for package client to resolve (e.g., the bytes > > > returned for decoded payloads in 3.X didn't play well with existing > > > str-based text processing code written for 2.X). > > > > I'm not following you here. Can you give me some more specific > > examples? Even if these "normal artifacts" must remain with > > the current API, I'd like to make things as easy as practical when > > using the new API. > > > > Thanks for all your feedback! > > > > --David > > > > > > _______________________________________________ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com