On approximately 10/3/2009 10:09 AM, came the following characters from the keyboard of Timothy Farrell:
I agree with Barry insofar as accepting bytes or strings on the input with 
internal processing in bytes and output bytes or strings depending on the 
content parsed.

Forgive my ignorance...why does converting bytes to strings have to be a mess?  
Rather than having two Feedparsers, can't we just pass a default encoding when 
instantiating a feedparser and have it read from the MIME headers otherwise?  
If not encoding is passed and one can't be determined, simply output as bytes 
or try a default and raise an exception if it fails.

If providing the default encoding, no such range check is needed.

----- Original Message -----
From: "Stephen J. Turnbull" <[email protected]>
To: "Barry Warsaw" <[email protected]>
Cc: "Timothy Farrell" <[email protected]>, [email protected]
Sent: Saturday, October 3, 2009 10:41:48 AM GMT -06:00 US/Canada Central
Subject: Re: [Email-SIG] fixing the current email module

Barry Warsaw writes:

 > So the basic model is: accept strings or bytes at the edges,
 > process everything internally as bytes, output strings and bytes at
 > the edges.

In a certain pedantic sense, that can't be right, because bytes alone
can't represent strings.

Practically, you are going need to say how a bytes or bytearray is to
be interpreted as a string, and that is going to be one big mess.
(MIME?)

Going the other way around you have no such problem, or rather the
trivial embedding works fine, except that you have to do a range check
at some point before you convert to bytes.

Email messages are bytes. Usually restricted to bytes in the range 32-127, but sometimes permitted to be 0-255 (8bit encoding).

Email messages carry sufficient information to convert bytes to strings (usually; and sufficient defaults to cover the other cases adequately, even if not with 100% certainty).

So if Barry is considering that the internal form is bytes, particularly bytes encoded via email RFCs, then I can't argue with that being a reasonable internal form.... except for one problem, 2 paragraphs below.

The only mess that I can see Stephen referring to is the fact that the email RFCs define rather messy encoding formats and character set specifications. There isn't much cure for this, AFAICS, other than perhaps keeping the bytes in segmented structures, with cached metadata to speed repeated references. Using any other format than email format, means knowing how to translate that format to/from email format, and to/from API format... this means coding two translation routines instead of one.

The choice of email RFC byte formats for the internal form makes it quick and easy to produce a complete message when called for, and to defer interpretation when a message is fed in.... sometimes, and herein lies the catch....

One problem with storing messages in bytes format: it seems to me that the choice of which of several legal email bytes formats to represent various email parts (texts and attachments) is problematical for using email format bytes as the internal storage format. An unsophisticated email library could assume that the transfer encoding is always 7bit, and that should be acceptable in all circumstances. A more sophisticated email library would provide support for either 7bit or 8bit transfer encodings.... but the choice of the bytes formats, and MIME type encodings of various message parts to support that difference would be significant. It seems that the present email lib provides only a way to create only a 7bit or 8bit message (and apparently not binary encoding), meaning that the whole message assembly process has to be done after initiating a connection with the SMTP server, to determine whether it supports 8bit (or binary) encoding or not. A more abstract internal format could defer that choice to the generate step, keeping items as str or binary blobs prior to that step.

IIUC, 7bit requires that text and binary be encoded to remove "difficult" byte values from the byte stream, so choosing quopri or base64 is appropriate at MIME part definition time to make that choice (although an optimal sized choice could be made based on the data), in the event that generate requests 7bit.

However, 8bit has no such requirement, it declares that there are no difficult characters except NULL, CR and LF. However, because no 8bit encodings are defined, the (inefficient, 7-bit) quopri or base64 may still have to be used to avoid lines that are too long, and to encode NULL, CR and LF. 8 bit and UTF-8 text containing no NULL characters and no long lines would qualify without encoding.

Finally, binary declares that there are no difficult characters at all. Therefore, the quopri or base64 choice could be ignored, and the raw data passed through.

Choosing a particular Content-Transport-Encoding as the internal storage format forces transcoding to the other Content-Transport-Encoding values on the fly after connecting to the SMTP server (using an apparently non-existent parameter to the generate method); not supporting on-the-fly transcoding would force the user to choose a particular Content-Ttransport-Encoding up front, requiring the connection to the SMTP server even earlier in the process.

I observe that most of my SMTP providers do not support binary transport, but it seems that MS Exchange does.

I observe that binary transport is more efficient than 7bit or 8bit.

I observe that even with binary transport, the MIME headers must still be in US-ASCII, by definition, so the headers need not be generated differently for different transports... only the Content-Transfer-Encoding, and the content itself, would be affected by deferring that choice to generate time.

Perhaps binary transport, with meta-data indicating whether the user prefers quopri or base64 for parts that must be encoded for 7bit or 8bit transport, would be an appropriate storage format for the email library. This would allow the quopri or base64 encodings to be performed on-the-fly, only if needed, by adding a new parameter to generate, that specifies the Content-Transfer-Encoding (which should default to 7bit for maximal server compatibility, or 8bit if the user specified that along the way so that backwards compatibility is preserved).


N.B. I note that the documentation for 2.6.3 section 19.1.3 MIMEtext function (reproduced below) is confusing:

/class /email.mime.text.MIMEText(/_text/[, /_subtype/[, /_charset/]])ΒΆ <http://docs.python.org/library/email.mime.html#email.mime.text.MIMEText>

   Module: email.mime.text

   A subclass of MIMENonMultipart
   
<http://docs.python.org/library/email.mime.html#email.mime.nonmultipart.MIMENonMultipart>,
   the MIMEText
   <http://docs.python.org/library/email.mime.html#email.mime.text.MIMEText>
   class is used to create MIME objects of major type /text/. /_text/
   is the string for the payload. /_subtype/ is the minor type and
   defaults to /plain/. /_charset/ is the character set of the text and
   is passed as a parameter to the MIMENonMultipart
   
<http://docs.python.org/library/email.mime.html#email.mime.nonmultipart.MIMENonMultipart>
   constructor; it defaults to us-ascii. No guessing or encoding is
   performed on the text data.

   Changed in version 2.4: The previously deprecated /_encoding/
   argument has been removed. Encoding happens implicitly based on the
   /_charset/ argument.


The confusion is that it states there is no encoding performed, and then it states that encoding is implicit. It is not clear what it actually does, if anything. The 3.2a0 documentation further muddies the water by removing the last paragraph.


--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

_______________________________________________
Email-SIG mailing list
[email protected]
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Reply via email to