Re: [Email-SIG] fixing the current email module

Glenn Linderman Tue, 06 Oct 2009 02:29:01 -0700

On approximately 10/3/2009 10:09 AM, came the following characters fromthe keyboard of Timothy Farrell:

I agree with Barry insofar as accepting bytes or strings on the input with 
internal processing in bytes and output bytes or strings depending on the 
content parsed.

Forgive my ignorance...why does converting bytes to strings have to be a mess?  
Rather than having two Feedparsers, can't we just pass a default encoding when 
instantiating a feedparser and have it read from the MIME headers otherwise?  
If not encoding is passed and one can't be determined, simply output as bytes 
or try a default and raise an exception if it fails.

If providing the default encoding, no such range check is needed.

----- Original Message -----
From: "Stephen J. Turnbull" <[email protected]>
To: "Barry Warsaw" <[email protected]>
Cc: "Timothy Farrell" <[email protected]>, [email protected]
Sent: Saturday, October 3, 2009 10:41:48 AM GMT -06:00 US/Canada Central
Subject: Re: [Email-SIG] fixing the current email module

Barry Warsaw writes:

 > So the basic model is: accept strings or bytes at the edges,
 > process everything internally as bytes, output strings and bytes at
 > the edges.

In a certain pedantic sense, that can't be right, because bytes alone
can't represent strings.

Practically, you are going need to say how a bytes or bytearray is to
be interpreted as a string, and that is going to be one big mess.
(MIME?)

Going the other way around you have no such problem, or rather the
trivial embedding works fine, except that you have to do a range check
at some point before you convert to bytes.

Email messages are bytes. Usually restricted to bytes in the range32-127, but sometimes permitted to be 0-255 (8bit encoding).

Email messages carry sufficient information to convert bytes to strings(usually; and sufficient defaults to cover the other cases adequately,even if not with 100% certainty).

So if Barry is considering that the internal form is bytes, particularlybytes encoded via email RFCs, then I can't argue with that being areasonable internal form.... except for one problem, 2 paragraphs below.

The only mess that I can see Stephen referring to is the fact that theemail RFCs define rather messy encoding formats and character setspecifications. There isn't much cure for this, AFAICS, other thanperhaps keeping the bytes in segmented structures, with cached metadatato speed repeated references. Using any other format than email format,means knowing how to translate that format to/from email format, andto/from API format... this means coding two translation routines insteadof one.

The choice of email RFC byte formats for the internal form makes itquick and easy to produce a complete message when called for, and todefer interpretation when a message is fed in.... sometimes, and hereinlies the catch....

One problem with storing messages in bytes format: it seems to me thatthe choice of which of several legal email bytes formats to representvarious email parts (texts and attachments) is problematical for usingemail format bytes as the internal storage format. An unsophisticatedemail library could assume that the transfer encoding is always 7bit,and that should be acceptable in all circumstances. A moresophisticated email library would provide support for either 7bit or8bit transfer encodings.... but the choice of the bytes formats, andMIME type encodings of various message parts to support that differencewould be significant. It seems that the present email lib provides onlya way to create only a 7bit or 8bit message (and apparently not binaryencoding), meaning that the whole message assembly process has to bedone after initiating a connection with the SMTP server, to determinewhether it supports 8bit (or binary) encoding or not. A more abstractinternal format could defer that choice to the generate step, keepingitems as str or binary blobs prior to that step.

IIUC, 7bit requires that text and binary be encoded to remove"difficult" byte values from the byte stream, so choosing quopri orbase64 is appropriate at MIME part definition time to make that choice(although an optimal sized choice could be made based on the data), inthe event that generate requests 7bit.

However, 8bit has no such requirement, it declares that there are nodifficult characters except NULL, CR and LF. However, because no 8bitencodings are defined, the (inefficient, 7-bit) quopri or base64 maystill have to be used to avoid lines that are too long, and to encodeNULL, CR and LF. 8 bit and UTF-8 text containing no NULL characters andno long lines would qualify without encoding.

Finally, binary declares that there are no difficult characters at all.Therefore, the quopri or base64 choice could be ignored, and the rawdata passed through.

Choosing a particular Content-Transport-Encoding as the internal storageformat forces transcoding to the other Content-Transport-Encoding valueson the fly after connecting to the SMTP server (using an apparentlynon-existent parameter to the generate method); not supportingon-the-fly transcoding would force the user to choose a particularContent-Ttransport-Encoding up front, requiring the connection to theSMTP server even earlier in the process.

I observe that most of my SMTP providers do not support binarytransport, but it seems that MS Exchange does.


I observe that binary transport is more efficient than 7bit or 8bit.

I observe that even with binary transport, the MIME headers must stillbe in US-ASCII, by definition, so the headers need not be generateddifferently for different transports... only theContent-Transfer-Encoding, and the content itself, would be affected bydeferring that choice to generate time.

Perhaps binary transport, with meta-data indicating whether the userprefers quopri or base64 for parts that must be encoded for 7bit or 8bittransport, would be an appropriate storage format for the emaillibrary. This would allow the quopri or base64 encodings to beperformed on-the-fly, only if needed, by adding a new parameter togenerate, that specifies the Content-Transfer-Encoding (which shoulddefault to 7bit for maximal server compatibility, or 8bit if the userspecified that along the way so that backwards compatibility is preserved).

N.B. I note that the documentation for 2.6.3 section 19.1.3 MIMEtextfunction (reproduced below) is confusing:

/class /email.mime.text.MIMEText(/_text/[, /_subtype/[, /_charset/]])¶<http://docs.python.org/library/email.mime.html#email.mime.text.MIMEText>


   Module: email.mime.text

   A subclass of MIMENonMultipart
   
<http://docs.python.org/library/email.mime.html#email.mime.nonmultipart.MIMENonMultipart>,
   the MIMEText
   <http://docs.python.org/library/email.mime.html#email.mime.text.MIMEText>
   class is used to create MIME objects of major type /text/. /_text/
   is the string for the payload. /_subtype/ is the minor type and
   defaults to /plain/. /_charset/ is the character set of the text and
   is passed as a parameter to the MIMENonMultipart
   
<http://docs.python.org/library/email.mime.html#email.mime.nonmultipart.MIMENonMultipart>
   constructor; it defaults to us-ascii. No guessing or encoding is
   performed on the text data.

   Changed in version 2.4: The previously deprecated /_encoding/
   argument has been removed. Encoding happens implicitly based on the
   /_charset/ argument.

The confusion is that it states there is no encoding performed, and thenit states that encoding is implicit. It is not clear what it actuallydoes, if anything. The 3.2a0 documentation further muddies the water byremoving the last paragraph.



--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

_______________________________________________
Email-SIG mailing list
[email protected]
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] fixing the current email module

Reply via email to