Re: [Email-SIG] API for Header objects [was: Dropping bytes "support" in json]

R. David Murray Thu, 16 Apr 2009 12:42:35 -0700

On Thu, 16 Apr 2009 at 14:08, Tony Nelson wrote:

At 23:02 +1000 04/16/2009, Steven D'Aprano wrote:

On Thu, 16 Apr 2009 10:39:52 am Tony Nelson wrote:

I don't want there to be any "str(msg['tag'])" or "bytes(msg['tag'])"
at all, so there would be no loss of consistency.


That's ... different.


Indeed.

Messages need
flattening to bytes, but there is no use for converting individual
header fields into bytes or strings, outside of a message.


Of course there is. You create each header individually, so you should
be able to extract each header individually. Here, for example, is a
use-case: I want to send postmaster a copy of the X-Spam-Evidence
header so she can see why a particular piece of ham got wrongly flagged
as spam, or visa versa:

X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; '(which': 0.03;
 'attribute': 0.04; 'objects': 0.04; 'returns': 0.05; 'split':
 0.05; ...

I need to be able to extract just that one header, and while some
applications (mail client?) may choose to give me the entire message as
text and expect me to manually hunt for the relevant line and
copy-and-paste it, other applications may wish to automatically extract
the appropriate header and email it to postmas...@localhost. Or write
it to a log file, or whatever. Whatever they do, they probably need it
as a string (of characters or bytes), not a binary blob.


This example seems tortured and contrived.  Custom code to extract a single
header one time to send to someone?  Just hit "reply" and trim it yourself.
If you must, you can use .get_header('X-Spam-Evidence').flatten().  I doubt
that anyone would actually do that, outside of a debugging session.

Any automatic process for sending reflected spam should include more of the
message, using the relevent MIME type message/partial (or message/rfc822).


Have you written a user interface using the email package?  I have.
In that user interface, I most definitely want to turn individual headers
into strings.  Specifically, this is a usenet news reader, and when
presenting messages I want to display _only_ the Date and From headers.
You will note that 'From' is an address header, and in this particular
use case I want to use "str(message['From'])", and I don't care two
hoots that the thing is properly a list of friendly-name address pairs.

That is not a contrived example, that's _production code_ that I
use every day.

Nor is the quoted example all that contrived...after reading it I was
considering if it would be useful to run a program over my incoming mail
to extract the X-Spam-Evidence headers and a couple other headers and
email them to me in a report daily.  It's not useful enough that I'll
write the code, I've too many other priorities, but it's potentially
useful enough (for tuning my spam filters) that I don't consider it a
contrived use case.  And if the spam gets worse I may just come back
to that idea.

Some
header field data /is/ strings, some is lists of address pairs, and
so on.


But "lists of address pairs" themselves are strings.


Wrong!  They are *lists* (or at least sequences) of address pairs of
friendly name, email address.  Just as bytes are not strings, and dicts are
not strings, and JPEC images, lists are not strings.  For better
understanding of what an Address is, see RFC 5322 (the current incarnation
of RFC x822), section 3.4, which describes both the best way and current or
obsolete practice.


I suspect that most or all of us do understand the RFC.

When Steve says 'but lists of address pairs are themselves strings' I hear
him saying that each element of the pair is a string.  I think you would
have to agree with that.  Unless you want them to remain as byte strings?
Or, as I would prefer, make them into Address objects with appropriate
methods and an appropriate str.  But even then, the friendly name and
address data elements of the Address should be unicode strings.

If the data for a header field is not properly a string,


But it always is.


No.  This is important, and you will not understand RFC x822 email until
you understand this:  email messages are not character strings.  They are
byte sequences.  This confusion pervades the email package only because in
Python before 3.x, bytes were represented as strings.


A header always has a string representation, though.  It's the one a
dumb-text UI would present to the user.  IMO the email package needs to
support building such UIs.  The string representation is also useful
for debugging (as is the bytes representation).  I see no reason
it should not be accessible through the normal Python 'str' method.
Why obfuscate access to it?

Even badly formatted emails with corrupt headers containing binary
characters are strings -- they're just byte (non-Unicode) strings
containing binary characters. Your mail server might not accept it as
part of a valid header, but it's a valid byte string.


Strings are not bytes.  Sequences of bytes are not strings.  Converting
between them demands an encoding.  Sometimes the encoding exists, sometimes
it mostly exists, and sometimes there is no such encoding, as for a JPEG
image, which is a structured byte sequence.


I agree with you that Unicode strings are not bytes, and that email is
encoded as (ASCII) bytes.

As for the JPEG, sure there's no encoding in the Unicode sense.  There
certainly is an encoding, though: JPEG wrapped up in the appropriate
mime type encoding.

a means to get it as one is wrong.


IMO it is always appropriate to be able to get a header body as a string.
It may not be a meaningful format in which to _manipulate_ the header
body information (which is why I think message's __getitem__ needs
to return a Header object), but it is a legitimate representation for
user consumption.

Email *is* text. It's built on top of a restricted range of ASCII bytes,
which we can legitimately call "text" because it is a subset of Unicode
text. Even if a particular header contains binary data, it must be
encoded as ASCII text before it can be placed into the header.

...

No, email is not text.  Email message bodies and some header fields may
represent text.  An email message is a byte sequence.  One really needs to
understand this in order to work with email at a low level.  When one does
not understand, then the email package should lead the user in the right
direction.


You and Steve are defining terms differently here, I think, but other
than that I suspect you are not that far apart on this particular point.

What I want the email package to do is make it easy to pass text in
and have the email package create the syntactically correct bytes
representation to go out on the wire.  I'm visualizing building the
'From' header, for example, something like this:

    message['From'] = AddressHeader(Address('John Smith', 'j...@foo.com'))

and have it default to UTF-8 encoding....or maybe the encoding gets
specified when I say message.serialize('utf-8').  But as I said, I
haven't actually written code that builds messages yet.

Note that while I want to be able to do str(someHeader) to get a
string representation of a header body, I'm not so enamored of being
able to do

    message['From'] = 'John Smith <j...@foo.com>'

and have it get turned into a Header or AddressHeader object.
Frankly, that looks too magical to me.

--David
_______________________________________________
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] API for Header objects [was: Dropping bytes "support" in json]

Reply via email to