On approximately 4/16/2009 6:02 AM, came the following characters from the keyboard of Steven D'Aprano:
On Thu, 16 Apr 2009 10:39:52 am Tony Nelson wrote:
I don't want there to be any "str(msg['tag'])" or "bytes(msg['tag'])"
at all, so there would be no loss of consistency.

That's ... different.
If the data for a header field is not properly a string,
But it always is. Even badly formatted emails with corrupt headers containing binary characters are strings -- they're just byte (non-Unicode) strings containing binary characters. Your mail server might not accept it as part of a valid header, but it's a valid byte string.

Wire format email headers are composed of a subset of ASCII text. There should be a way to obtain them, either as bytes, or via the trivial str conversion of those bytes to Unicode. Even corrupt headers containing binary characters should be obtainable that way. There are no header encoding or decoding algorithms that cannot be reworked to function properly on either the raw_bytes or raw_str version of a header, since the numeric values and sequence of all binary octets would be preserved via both raw_bytes and raw_str. *The key is to know what is in hand.* For both raw_bytes and raw_str, all characters would be in the range 0 - 0xFF. This is simple transliteration, not interpretation or parsing. A non-corrupt header would have a smaller range, 0x20 - 0x7F. Any header should be obtainable or settable in this form, using either bytes or str parameters/results. Yes, it should be possible to create corrupt headers in this manner. Useful mostly for testing, or for idempotency (which I also call GIGO).

However, obtaining headers in that way should be "hard", but only the sense of having to type more because it is part of a lower level interface, not the primary APIs... like msg['tag'].raw_bytes or msg['tag'].raw_str... because it is actually the easiest way (implementation-wise) to obtain a copy of the data... but that copy may not be as useful as one might like.

str(msg['tag']) or msg['tag'].str (or some such spelling[s]) should always produce a displayable form of the header. If it is a known, standardized header that may contained data that was encoded for transmission, such encodings should be reversed, and Unicode characters outside the range of U+0020 - U+007F may be included. Remember the goal here is "displayable". So if the encoding is bad for a standard header, or a standard header is corrupt, or a non-standard header contains what is apparently binary gibberish, and non-displayable Unicode control characters are generated, they should be escaped as 7 ASCII characters representing a Unicode code point "\U+0017". All such display strings must always have "\" converted to "\\" so that there is no ambiguity when interpreting strings that may contain text that looks like one of the escape strings.

Known standard headers should have additional APIs (these already exist for the most useful ones) to obtain the interesting subcomponents (encodings, names, addresses, MIME types, etc.). These should have str parameters and results interfaces only, and specification of an encoding can be optional, defaulting to UTF-8 (or possibly defaulting to a Message-level encoding specification, which in turn may default to UTF-8), overridable in some of the APIs via optional parameters (some, because overloaded assignment APIs may not have room for such overrides, not having optional parameters).

--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

_______________________________________________
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Reply via email to