from:"Stephen J. Turnbull"

[Email-SIG] Some parsing/generation issues of email in Python 3

2016-06-08 Thread Stephen J. Turnbull

Hans-Peter Jansen writes:
 > Dear audience,
 > 
 > when coming back to this list, I couldn't believe my eyes because
 > of the low volume level, but after rechecking with the archives, I
 > have to accept, it is that quiet here, a bit too quiet from my
 > POV. Hmm.

It's just that very few people (one or two) are working on the module
and in my experience it has been rock-solid compared to either Python
2.7 email or the package distributed with Mailman 2.1.  I doubt very
many people are using Python 3 email on high-volume mailstreams yet,
as the high-performance networking (eg, Twisted) and perhaps some
other libraries were late to be ported.

 > I was quite astonished to find out, that this procedure isn't
 > working that well anymore: the email module appears way more
 > sensible in the current state.  This is a bit disappointing, as
 > reading the docs conveys, that some effort was put into reliability
 > and robustness. Given the much improved unicode handling of Python
 > 3 itself and the ever improving experience in handling emails, this
 > is contrary to my expectations, I have to confess.

It's a complete rewrite from first principles.  It's more robust in
principle and more maintainable in practice, but faced with 100s of
millions of emails (aka "tsunami of sewage"), the robustness can't be
guaranteed.  I'm willing to bet it will converge to "robust in
practice" much faster than the previous design did.

 > Minutes after switching to the new code, I stumbled across a traceback in 
 > msg.get_all('to') from a header like this:
 > 
 > To: unlisted-recipients: ;,
 > ""@pop.kundenserver.de (no To-header on input)
 > 
 > Hmm, not nice. http://bugs.python.org/issue27257

The header arguable fails to conform to RFC 5321, though it's
syntactically permissible in RFC 5322.  (See my comment on the issue.)

 > All these issues were harvested in less than halve an hour. What
 > really troubles me is the quietness around here in the light of
 > this experience. Doesn't people use Python (3) yet/anymore for
 > these kind of tasks?

Probably not.

 > Does somebody care?

email 5 for Python 3 is a complete rewrite from first principles.
Yes, somebody cared.

 > Am I missing something?

Patience and understanding of how opensource software development
works, perhaps.

___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
https://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] API for email threading library?

2012-01-09 Thread Stephen J. Turnbull

Bill Janssen writes:

  I think I'll finesse this issue with another (appropriate) layer of
  indirection.

OK by me (can't bring myself to +1 on a thoughtful finesse. :)

   In a Lisp implementation of http://www.jwz.org/doc/threading.html I'm
   working on, I just use symbols named by the message IDs themselves;
  
  Yes, that works well for a static persistent representation.
  
  Lisp message threading?  What's that in aid of, if you can say?

The VM MUA for Emacs and XEmacs.

  RFC 5256 mentions it, but I had to go back to 2822 to figure it out.

Tee-hee-hee!  The wild, wonderful world of RFCs: You are in a twisty
maze of ABNF, all alike 
___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] header folding

2011-07-28 Thread Stephen J. Turnbull

Glenn Linderman writes:

  To me, wrap means to divide and join as necessary a set of lines 
  (sometimes/often a paragraph) to achieve some number of similar length 
  lines, not to exceed a line length limit, with possibly a shorter one at 
  the end.

Typically such usage is in contexts where a paragraph is represented
as a single physical line, though.  Your set is not part of wrap
in my dialect.

  I think that if these terms are defined in the RFCs, that those 
  definitions should be preferred to mine.

Fold is defined per RFC 5322.  The others don't seem to be.

I think fold should be used for the well-defined operation of header
folding (RFC 5322) and also for the well-defined operation of
inserting a soft linebreak in quoted-printable bodies (RFC 2045).
I'm happy with whatever usage others prefer for the other operations.
___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] header folding

2011-07-27 Thread Stephen J. Turnbull

R. David Murray writes:

  Hmm.  Makes sense to me.  So you'd rather the method were called fold
  and that refold_source remains the name of the policy control.

Yes.

  What's the word for what is done when a text message is made to have
  a line length of less than 78 by using quoted printable (or base64)
  encoding?

RFC 2045 discusses insertion of soft line breaks; it doesn't mention
a term like folding.  Folding seems like a good term to me,
though.  Note that the RFC 2045 definition of quoted-printable says
that physical line length MUST be 76 characters or less, including any
terminating = but not the CRLF pair that separates lines.

  Can anyone see a use case for controlling folding of headers
  separately from folding of message bodies?  I haven't thought of
  one, which is why I'm thinking one policy knob controls both.

The RFCs' treatments differ somewhat.  RFC 5322 has both a MUST NOT
and a SHOULD NOT exceed limit on line length (998 and 78 characters,
not including the CRLF, respectively).  RFC 2045 quoted-printable has
only the MUST NOT limit of 76 (but the difference in limits is not a
big deal).

It's not clear to me what exactly the policy knob you're talking about
is for body text.  There is no policy really allowed if quoted-
printable is being used.  So the policy knob is whether to use
quoted-printable to limit physical line length?

The only reason I can think of for having separate controls is that
many MUAs mishandle quoted-printable in the body text.  Patches don't
apply, one-time-key URLs in links get broken and fail to be
recognized.  On the other hand, header-folding rarely has such
consequences in my experience.

___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] header folding

2011-07-27 Thread Stephen J. Turnbull

R. David Murray writes:

  That's an interesting point.  So perhaps I should rename the control
  'header_source_refold'.

I don't know have a strong opinion, but I tend to think it's
unnecessary.

  On the other hand, we could also provide a separate control
  for whether or not quoted printable bodies in particular were
  folded,

If the body is already known to be quoted-printable, you don't really
have a choice.  Folding lines longer than 76 characters after
quoted-printable encoding is required by RFC 2045.  Of course you can
do more folding than necessary (eg, fold an 85-character line at 35
and 70 characters), but that doesn't seem very useful to me.

It seems to me that the policy question (if it exists) is We have an
all-ASCII body with 'long lines'.  Shall we encode in quoted-printable
only for the purpose of folding the long lines?
___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] header folding

2011-07-27 Thread Stephen J. Turnbull

Barry Warsaw ba...@python.org writes:

  That's at least what I think of, and I do think we could
  have two knows to control the different functionality:
  
  - To 'split' a line means to take a line longer than a specified maximum, and
make it fit into the maximum line length, splitting at whitespace or other
semantic separators.

In the case of headers, folding is hallowed usage (going back to at
least RFC 733), and is very precisely defined by RFC 5322.  If we are
going to do something non-RFC conformant (yeah, right, we might do
that, eh?), splitting would be better.  If our implementation is
intended to be conformant, I think folding is preferable both for
familiarity and ease of reference (look it up in RFC 5322).

I think the generalization to bodies is reasonable, although I haven't
found any RFC usage of folding in that context in a quick look.

  - To 'fill' a header means to take the logical contents of the
  header and recombine and resplit it so that each line is as close
  to the maximum line length as possible.  My analogy here is Emacs's
  M-q (fill-paragraph).

  What then is [...] wrapping?  Maybe no different than the above.

In my dialect, what you describe as filling is (at least
potentially) far more sophisticated than what I mean by wrapping.
Wrapping moves forward through each line and at the maximum length
backtracks to the rightmost break point in the line, breaking there,
then continuing the process in the tail line.  This could and often in
my experience does result in very uneven lines.

However, I don't think we're talking about filling here.  Filling IMHO
should be implemented by the email module, but it should be called
explicitly by the client, not imposed internally on the basis of a
global policy.

Consider the following ugly header (which is somewhat unlikely to
actually appear in a real use case, although it could easily result
from cut-and-paste into an MUA's to field):

To: Amie Cawinski a...@abc.org, Ichabod
 Tallman i...@cow.org

(there is no trailing whitespace on either line).  IMO, there are two
plausible fillings (assuming a limit of 78 characters) here:

To: Amie Cawinski a...@abc.org, Ichabod Tallman i...@cow.org

and

To: Amie Cawinski a...@abc.org,
Ichabod Tallman i...@cow.org

of which the second will be uglified by a RFC-5322-conformant
processor into:

To: Amie Cawinski a...@abc.org,Ichabod Tallman i...@cow.org

(note the extra space after the comma).  I personally don't consider
either of

To: Amie Cawinski a...@abc.org,
 Ichabod Tallman i...@cow.org

To: Amie Cawinski a...@abc.org,
TABIchabod Tallman i...@cow.org

plausible as a presentation, but YMMV.  So filling (to me) is about
presentation, not protocol conformance.

Anyway, I don't see how we can justify making *these* choices for the
user on the basis of a policy that really is about conservative
compliance to a wire protocol standard.  For example, I personally do
not fill 81-character subject headers; it's just too ugly.  However,
I might want my mail program to conservatively fold them, especially
for certain correspondents known to be stuck behind weird MTAs or MUAs.

  You might have a message body that contains code, in which case you
  might want to fill the headers (using the terminology above), but
  not fill the body.

That's another example of why control for filling has to be flexible
(and why IMHO filling should be called explicitly by the client).

However, if the receiving MUA is RFC 2045-conformant, the user cannot
tell that quoted-printable folding was used.
___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

[Email-SIG] header folding

2011-07-25 Thread Stephen J. Turnbull

R. David Murray writes:

  the end.  Basically, BaseHeader gets a 'wrap' method, and there is
  a new policy control, 'refold_source' (I'll probably rename it to
  'rewrap_source', since I expect to apply it also to message
  bodies).

This bothers me.  Folding and wrapping are two different things.

Folding is about invertibly reformatting a single logical line to make
machines happy during transmission, what wrapping does is not 100%
clear to me but it's about making people happy.  (I put does in
quotes because it's not obvious to me that the source of wrapped text
necessarily is a single anything, nor that wrapping need be
invertible.)

I grant that people and many MUAs take a different point of view about
header folding, but clearly the RFCs have moved away from placing any
importance on presentation aspects toward specifying an invertible
transformation exactly.  On the other hand, I think that wrapping
should place emphasis on presentation.
___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

[Email-SIG] question on syntax of 'group' in address-list

2011-05-16 Thread Stephen J. Turnbull

R. David Murray writes:
  I've gone through the RFCs and done some additional googling,
  and haven't been able to confirm the answer to this question: what
  exactly is the syntax when a group is included in an address-list? (See
  http://tools.ietf.org/html/rfc5322#section-3.4).  The question is, if
  another address follows the group, are they separated from each other by
  ';' or by ';,'?  The ABNF seems to call for the latter, but I can't find
  any example showing it.  I'm sure that I should accept both on input,

Why?  I mean, YAGNI.

  but I'd like to generate the correct form.  Does anyone have confirmation
  or contradiction for my interpretation?

From RFC 822.  The Cc field contains two groups, separated by ,,
with each group terminated by ;.

A.3.3. About as complex as you're going to get

 
 Date :  27 Aug 76 0932 PDT
 From :  Ken Davis kda...@this-host.this-net
 Subject  :  Re: The Syntax in the RFC
 Sender   :  KSecy@Other-Host
 Reply-To :  Sam.Irving@Reg.Organization
 To   :  George Jones gr...@some-reg.an-Org,
 Al.Neuman@MAD.Publisher
 cc   :  Important folk:
   Tom Softwood ba...@tree.root,
   Sam Irving@Other-Host;,
 Standard Distribution:
   /main/davis/people/standard@Other-Host,
   Jonesstandard.dist.3@Tops-20-Host;
 Comment  :  Sam is away on business. He asked me to handle
 his mail for him.  He'll be able to provide  a
 more  accurate  explanation  when  he  returns
 next week.
 In-Reply-To: some.string@DBM.Group, George's message
 X-Special-action:  This is a sample of user-defined field-
 names.  There could also be a field-name
 Special-action, but its name might later be
 preempted
 Message-ID: 4231.629.XYzi-What@Other-Host

___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] [Python-Dev] email package status in 3.X

2010-06-17 Thread Stephen J. Turnbull

l...@rmi.net writes:

  FWIW, after rewriting Programming Python for 3.1, 3.x still feels
  a lot like a beta to me, almost 2 years after its release.

Email, of course, is a big wart.  But guess what?  Python 2's email
module doesn't actually work!  Sure, the program runs most of the
time, but every program that depends on email must acquire inches of
armorplate against all the things that can go wrong.  You simply can't
rely on it to DTRT except in a pre-MIME, pre-HTML, ASCII-only world.
Although they're often addressing general problems, these hacks are
*not* integrated back into the email module in most cases, but remain
app-specific voodoo.

If you live in Kansas, sure, you can concentrate on dodging tornados
and completely forget about Unicode and MIME and text/bogus content.
For the rest of the world, though, the problem is not Python 3.  It's
STD 11 (which still points at RFC 822, dated 1982!)  It's really
inappropriate to point at the email module, whose developers are
trying *not* to punt on conformance and robustness, when even the IETF
can only run in circles, scream and shout!

Maybe there are other problems with Python 3 that deserve to be
pointed at, but given the general scarcity of resources I think the
email module developers are working on the right things.  Unlike many
other modules, email really needs to be rewritten from the ground
(Python 3) up, because of the centrality of bytes/unicode confusion to
all email problems.  Python 3 completely changes the assumptions
there; a Python 2-style email module really can't work properly.

Then on top of that, today we know a lot more about handling issues
like text/html content and MIME in general than when the Python 2
email module was designed.  New problems have arisen over the period
of Python 3 development, like domain keys, which email doesn't
handle out of the box AFAIK, but email for Python 3 should IMHO.

Should Python 3 have been held back until email was fixed?  Dunno, but
I personally am very glad it was not; where I have a choice, I always
use Python 3 now, and have yet to run into a problem.  I expect that
to change if I can find the time to get involved in email and Mailman
3 development, of course.wink

___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] invertability and idempotence

2009-10-22 Thread Stephen J. Turnbull

Andrew McNamara writes:

  The discussion had referred to idempotency up until that point, and I
  didn't want to introduce new terminology. But referring to this:
  
  generate(parse(msg)) == msg
  
  as idempotency is perfectly valid in my opinion (as in, applying an
  operation multiple times produces the same result). 

That would be generate(generate(msg)) == generate(msg) or
parse(parse(email)) == parse(email).  The input and output of
these functions are of *different types*, they cannot possibly be
idempotent.

I'm +1 on changing to use invertible, -0 on continuing to use
idempotent (since it's the traditional idiom), and -1 on using
idempotent to mean is deterministic, ie, generate(msg) ==
generate(msg).

If msg changes state in an irrelevant way, it would be nice to produce
the same output from generate.  But that is not idempotency.

And we would need to specify precisely what irrelevant means.  For
example, if a client of the Message class decides to specify the MIME
boundary explicitly, then the output of generate has to change IMO.
OTOH, many MIME implementations put the time of day or the generating
process into the MIME boundary.  This is unnecessary (boundaries need
to be unique only message-wide, and the email package can adjust the
boundary to not conflict with message content, eg, Emacs/Gnus uses
something like -=-=-=-=- by default), and I would hope that email
avoids such practices when possible.

___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] invertability and idempotence

2009-10-22 Thread Stephen J. Turnbull

Andrew McNamara writes:

didn't want to introduce new terminology. But referring to this:

generate(parse(msg)) == msg

as idempotency is perfectly valid in my opinion (as in, applying an
operation multiple times produces the same result). 
  
  That would be generate(generate(msg)) == generate(msg) or
  parse(parse(email)) == parse(email).  The input and output of
  these functions are of *different types*, they cannot possibly be
  idempotent.
  
  You're splitting hairs - the operation generate(parse(X)) is
  idempotent, and that's what I was referring to.

Yes and no.  The equation above does imply idempotency, but it is a
much stronger statement: generate(parse()) is the identity.  That
stronger statement could be useful in practice, but it could also be
expensive to implement.  That tension could engender flamewars if the
requirement is expressed by the word idempotency but the intent is
identity.

For example, suppose that for MIME multipart messages, generate() uses
$%$%$%$%$%$ as the separator as long as no component contains that
string.  Then generate(parse(msg)) will be *equivalent* but not
*identical* to msg for most messages received from non-Python-email-
using MUAs.  generate(parse()) is idempotent, though.  I don't think
the folks who ask for idempotency would be satisfied with that!

As I said earlier, if we're going to use the word idempotent to mean
invertible, that's established practice, so we footnote the
Humpty-Dumpty-ism, and I can live with that.  But if we're going to
try to be more accurate, let's be fully accurate.
___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] fixing the current email module

2009-10-12 Thread Stephen J. Turnbull

Barry Warsaw writes:

  I would proposal a radical suggestion: we treat backward compatibility  
  the way Python 3 did.  Nice to keep, but we can throw it over the side  
  in order to fix the warts.  We'll worry about migration strategy later.

+1

  Aside: I would really like to have a much more @property based API  
  where appropriate.

+1

  E.g. Message.get_content_type() would be Message.content_type.  And
  in this case we'd probably have message.payload_bytes or some such.
  Decoding may require additional parameters so it will probably be a
  method.

Maybe, but in general those parameters can be deduced from the
metadata.  If we can use those defaults often enough, then the
default-decoded version can be a property too.

We would have to provide alternatives, though.  I've seen Shift JIS
encoded Japanese labelled ISO-2022-JP, and apparently many Japanese
MUAs actually decode that to Japanese!  Not suggesting that we should
do the same, but probably the generic function that is used to decode
should be exposed as a method so that clients who encounter such
nonsense can deal with it, and override any of the metadata.
___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] fixing the current email module

2009-10-11 Thread Stephen J. Turnbull

Glenn Linderman writes:

  conformant is not in the dictionaries I've consulted.

Try these (top 3 goggle results for conformant):

conformant- WordWeb dictionary definition
(computing) conforming to a particular specification or standard In
this paper we present a new approach to conformant planning. Nearest
...
www.wordwebonline.com/en/CONFORMANT - Cached - Similar - 
conformant - Definition from the Merriam-Webster Online Dictionary
conformant can be found at Merriam-WebsterUnabridged.com. Click here
to start your free trial! Click here to search for another word in the
Merriam-Webster ...
www.merriam-webster.com/dictionary/conformant - Cached - Similar - 
Conformance
The notion of TEI conformance is intended as an aid in describing the
format and contents of a particular document or set of documents. ...
www.tei-c.org/Guidelines/P4/html/CF.html - Cached - Similar - 

A quick look at some of the results show that the word conformant is
typically used in a section called conformance, which defines what
criteria are used to determine if an application is following the
standard or not.  OTOH, the fact that the top three results are
dictionary definitions suggests an awful lot of people are looking up
the word in dictionaries

  Conforming is mostly a verb, not an adjective.

Goggling gives Results 1 - 10 of about 3,680,000 for conforming
application, but  Results 1 - 10 of about 324,000 for conformant
application.  Looks like conforming is the preferred adjectival
form.

  but conformable and compliant are synonyms.

When used to mean submissive.  Conformable won't do.

  English is hard enough for ESL folks when they can find the 
  words in the dictionary.

Compliant does seem to be the winner.  Results 1 - 10 of about
13,900,000 for compliant application.  Conformant or conforming is
better IMHO but much less popular.  Tie goes to the lusers, as usual.
___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] fixing the current email module

2009-10-10 Thread Stephen J. Turnbull

I'm running out of time to work on this (yeah, I know it's the
weekend, but my life is like that lately).  I think we're converging,
though, so I'd like try and tie some of those ends together.

Glenn Linderman writes:
  On approximately 10/9/2009 8:10 AM, came the following characters from 
  the keyboard of Stephen J. Turnbull:

   Actually, I would say you are emitting leniently, in violation of the
   Postel principle.  
  
  You can say that, but I don't have to believe it.  I'm talking about 
  accepting; the message has arrived, it is here, the client is trying to 
  look at it, and I'm talking about ways the client can look at 
  not-quite-perfect data, knowing that it is not quite perfect, but still 
  being able to see it.  I'm not at all talking about emitting data.

It would be indeed, if the corrupt data is stored in the place where
correctly decoded data normally is stored, and is accessible in the
same way.  But I gather that's not what you were talking about, my
mistake.

  You seem to be calling the email package helping the client to
  accept not-quite-perfect data, as a form of emitting data.  It is
  not.

No, I was confused by the way you wrote.  Saving the data *somewhere*
is absolutely necessary; not losing data is the #1 commandment of
low-level mail processing.  Surely the email module is subject to that
commandment.  *Nobody* is talking about losing any data yet, except
Barry indirectly when he says that some people think giving up on
invertibility (often called idempotency), and even he is quite
adamant that he's not going to give up on that.

So when you wrote about saving and converting to text form, without
mentioning that the specific APIs, I assumed you meant the mainline
APIs for parsing and accessing parts of a correctly formatted message.

  The email package cannot police the client... if it chooses to eat it 
  in a single gulp without looking at it then it may get indigestion.  I 
  never suggested that converting to Unicode as if it were Latin-1 
  should be done without informing the client, or being requested by the 
  client to do that via a special API call...

Well, maybe I misread it, but it certainly looked like that to me.  I
would not object to that special API call defaulting to ISO 8859/1.

  If you ignore defect reports, you are ignorant (blunt, but not intended 
  to be offensive).

What I worried about is that if defect reports are present, *but
displayable data is also present*, programmers *will* simply display
it, for example in producing a prototype program.  It will be
impossible to determine without very close analysis of that program
that an early version became a production version without adding
appropriate checks.  In practice, this bug will be discovered when
some end user's installation breaks.

It seems that you agree with this, and because the special API call is
necessary, it will be easy to identify whether proper care is being
taken or not.  Right?

 It is still raw user input, and should still be checked for proper 
 syntax by the client,
  
   Nonsense.  The email module had better know a lot more about syntax
   than the client.  If it doesn't, whack it with a 2x4 until it learns!
  
  I think we are talking at cross purposes here.  I find it quite 
  difficult to follow where you cross the boundary between talking about 
  one sort of email package client, and then switch to another type, or 
  switch to the responsibilities of the email package.

Excuse me?  The raw user input you referred to above is material
that the client software receives from the email package.  The email
package should give it to the client in the normal (convenient) way
only if it can certify that it conforms to the appropriate standard.

That standard should be specified in the API documentation.  Any more
detailed structure, of course, is the responsibility of the client.

  An application which is using email as a transport, has specific goals, 
  which require specific content.  You were mentioning clients.

I've already said that when I speak of an MUA, I write MUA.  In
speaking of the calling program, which might even be a user running
the module via the Python interpreter, I write client.  It's a very
convenient way to describe the user of an API, in contrast to the
provider of the API (the implementation).

  If such a client doesn't validate the syntax of that content, it
  isn't much of an application.

If that MUA or email application uses RFC 822 addresses, it should be
able to rely on the email module to parse those addresses correctly,
or provide a defect report.  One might even go so far as to suggest
that it be able to parse the (non-RFC, but very common) + notation
for separating the mailbox from additional data used for VERP and
challenge-response applications.  That would have to be documented,
but if so documented client applications like the MUA should be able
to rely on it (and you can bet many will).

Application domain syntax

Re: [Email-SIG] fixing the current email module

2009-10-10 Thread Stephen J. Turnbull

Glenn Linderman writes:
  On approximately 10/9/2009 3:08 PM, came the following characters from 
  the keyboard of Tokio Kikuchi:

   Your suggestions 1)-4) are not accesptable to Japanese users at
   all.

  If a message with an encoded header arrives (like your number 2 sample) 
  but it cannot be decoded, what action _is_ acceptable to Japanese 
  users?  And what action is implemented in Mailman (if different)?

I know a fair bit about Japanese (both the language and the users),
and I'm having difficulty understanding what Tokio means, given your
list of hypotheses.  I suspect he's basically rejecting the hypothesis
that it can't be decoded -- if it can't be decoded, then learn how to
do so!

  I can think of a 5th technique... don't modify the header, and send
  it through unchanged.  Now I think I've covered the gamut of
  possibilities,

I agree.  However, I think we're way out of bounds here.  We already
know how to decode anything that RFC 2047 can throw at us in charsets
that Python can handle.  Anything that can't be decoded then is
seriously malformed from the point of view of the mailing list users.
So why are we discussing this?  We don't even know what our mainline
APIs are going to look like, why are we discussing forcibly operating
on broken input?

[[ Aside:

  with an appropriate translation for Re: ).

Re is a Latin abbreviation; there is no appropriate translation. ;-)
]]

  MUAs or mailing list handlers that attempt to retain what was sent
  (idempotency or invertibility), would be more likely to do what I
  describe, and are more robust when faced with new character sets
  that they don't understand how to decode.

Maybe they are, but the email module doesn't know or care about what
they do.  Let's stick within what the email module is supposed to
handle.

___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] fixing the current email module

2009-10-10 Thread Stephen J. Turnbull

R. David Murray writes:
  I have set up two more documents on the wiki.  One is UseCases[1], [...].
  The other is a Glossary[2].

Thank you, very much!

  I think most of it accurately reflects the consensus here, but in
  it I'm proposing to use the term 'transfer-decoded' for #3, and
  'transfer-encoded' as an alternative to 'wire-format' just for
  symmetry.  Comments and suggestions welcome.

'Wire-format' means you can cat it to the wire, ie, RFC-conforming
(in fact, it's the only meaning in the RFCs by definition), and for
email itself it's always bytes AFAIK (Mama don' 'low no XML roun'
here, Lord, Lord!).  That's not true of all our applications, though,
especially stuff like doctests.  There are also some RFCs we use such
as BASE64 (specifically relevant to transfer encodings) that are
defined in terms of characters, not bytes, so 'transfer-encoded' is
slightly different from 'wire-format'.

I think in general that kind of comment should be applied directly to
the Glossary, but what deserves general discussion is how pedantic do
we want to be?  I think the distinction made here between 'wire-format'
and 'transfer-encoded' is useful *to us*, and in general lean toward
high pedantry (cf how much smoke and how little fire Glenn and I are
generating!)  WDOT?
___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] fixing the current email module

2009-10-10 Thread Stephen J. Turnbull

Glenn Linderman writes:
  (I switched conformant to compliant,

Conformant is in common use.  You might be more comfortable with
conforming.

Richard Stallman points out that you comply with the law, but you
conform to a standard.  I think it's useful to make that semantic
distinction, cf. RFC 2119 MUST vs. SHOULD or MAY.
___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] fixing the current email module

2009-10-08 Thread Stephen J. Turnbull

Oleg Broytman writes:
  On Wed, Oct 07, 2009 at 11:23:24AM -0500, Matthew Dixon Cowles wrote:
   In my opinion, the email module should never raise an exception as a
   result of working with a malformed message. Though it should
   certainly make the information that a message was malformed available
   for the calling program to check.
  
 I disagree. email package is not a user agent, and exceptions are *the*
  way to indicate there are problems.

Although practicality beats purity.

The email package has access to the wire format, and knows what to do
with most of it.  It should DTRT where that is possible, and punt
where not.  By punt I mean return a special object containing as
much of the meta data for an object as it could recover, along with
the data itself as a blob.

I would suggest that module utilities that require access to the
parsed form of data be designed as object methods.  The special
objects produced when broken wire format is encountered wouldn't have
those methods, and thus they'd fail the duck type test.  But that
makes sense: that duck can't quack anyway.

So this gives our (== Matt and me) desideratum that email never raises
(it's the Python runtime that will raise AttributeError), and also
Oleg's (in part, anyway): an exception *will* be raised.

I think (== hope) that this will sufficiently localize the issues that
even though only AttributeError would even be raised, it will be
obvious what went wrong.

 Then the calling program must catch all exceptions

That is just unreasonable.  There are too many ways for things to go
wrong.  If you have just one exception for all problems, it's easy to
catch them all, but then the client doesn't know what went wrong, and
has to partially parse the unparsable itself.  That's nuts; the reason
for using the email module is to delegate that in the first place, and
besides, to the extent it's possible, the module has presumably done
that.

OTOH, a long list of precise exceptions is both a maintenance burden
on the email module and on client programmers.

 Yes, if email parse a message in some way - ok. You can help by creating
  more intelligent parser(s). But if a parser stumbles upon an unparseable
  block - it must raises an exception.

No, that's the last thing you want it to do.  Suppose you have

Content-Type: multipart/alternative

Content-Type: text/plain

Content-Type: text/html; body-parseable=no

Clearly you want (a) a vanilla email client to just grab the
text/plain part, and (b) a client written by somebody whose boss uses
BustedMUA[tm] to be able to try to parse the text/html part, using the
special rules that apply to the jumble produced by BustedMUA.

In other cases, you might be able to find a valid part terminator, but
the header of that part was hosed.  So the whole part becomes a blob,
but the parser should resync at that point, and start parsing
following parts.

I can think of no input for which the parser should *ever* throw an
exception.  Utilities that depend on a particular object's parsed form
might have do so, but even then it should be avoided if at all
possible.

___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] fixing the current email module

2009-10-08 Thread Stephen J. Turnbull

Glenn Linderman writes:

 If conversions are avoided, then octets are unlikely to be out of 
 range?
  
   Haven't looked in your spam bucket recently, I guess.  Spammers
   regularly put 8 bit characters into headers (and into bodies in
   messages without a Content-Type header), for one thing.
  
  I'm aware of that, but if conversions are not done, octets are unlikely 
  to be _reported_ to be out of range

Conversions will eventually be done.  Best it were done quickly.

   Most clients are simply not going to be prepared for the kind of
   crap I see in /var/mail/turnbull every day.
  
  Are you referring to most email clients, or most 
  Python-email-library-using clients?

Sorry.  When I mean MUA I try to say MUA.  By client, I'm
referring to the higher level logic that is going to be calling the
email module.

  Is it your point of view, then, that incorrectly formed email should be 
  mostly treated as SPAM?

Heavens no!  Not by the email module, anyway!  The email module should
not know about spam (but see Barry's we're having spam for Launchpad
post: if you're that good, anything goes!), except maybe at a very
high level.

  Your hit me with your best shot comment indicates that you want a
  failure code or exception when the data is bad, and then a way to
  retry accepting errors?

My curent thinking is that the email module should return an object
representing a partial parse.  The way that you find out if it is
partial is to try to access some data that should be in the object.
If the parse succeeded, the accessor returns the data (which might be
empty).  If the parse did not succeed, you get an AttributeError.
(This is just a paraphrase of what I wrote in response to Oleg.)
___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] fixing the current email module

2009-10-08 Thread Stephen J. Turnbull

Oleg Broytman writes:

   where not.  By punt I mean return a special object containing as
   much of the meta data for an object as it could recover, along with
   the data itself as a blob.
  
 The special object is an instance of an exception class ;)

It could be, but it will be returned with return, not raise. ;)

   I think (== hope) that this will sufficiently localize the issues
   that even though only AttributeError would even be raised, it
   will be obvious what went wrong.
  
 Not exactly. One can see an AttributeError, but what was the
  cause? why a parser has created a broken object? AttributeError
  doesn't preserve information from parser.

Who said it wouldn't?  Granted, I didn't say it would, but in my

Content-Type: multipart/alternative
Content-Type: text/plain
Content-Type: text/html; parseable=no

example, I would expect the object returned to reflect that
structure.  In particular the object representing the second MIME part
would indeed possess a valid Header member.  I would also attach the
original data (which in the case of a missing separator might very
well overrun into other parts, etc), but it would *not* be accessible
via the usual methods (eg, definitely not from .flatten()).

So in fact it's not clear to me that you could ask for more
information than that.

   I can think of no input for which the parser should *ever* throw an
   exception.
  
 Are you saying that even a random garbage would be parsed to a Message
  of some kind? No headers, a single unparsed body?..

As long as it contains no NULs or high-bit-set octets, and is
separated into at least two parts, each less than 998 characters long,
by a CRLF, yes, I would definitely expect that an otherwise randomly
generated string would be parsed to a Message.

This Message should not be sendable because RFC 5322 requires the
presence of a From and a Date.  However, if you were implementing a
sendmail-compatible MTA or LDA, you might very well wish to accept
such a thing on stdin, parse it to a Message, and then default the
From and Date header fields appropriately, and add a Message-ID header
field.  I would, anyway, wouldn't you?

Ah, yes, that's another use case, isn't it?!
___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] fixing the current email module

2009-10-08 Thread Stephen J. Turnbull

Barry Warsaw writes:

from email import message_from_string
with open('/dev/urandom') as wire:
  ...   data = wire.read(1024)
  ...

# insert A

msg = message_from_string(data)
# number of headers
  ... len(msg)
  0
len(msg.get_payload())
  1024
msg.defects
  []
  
  This actually makes perfect sense.  A message with no headers and a  
  mass of 1024 bytes in its payload is RFC valid!

If you insert at A

 wire = .join(chr(ord(ch)  127) for ch in wire)
 # optional with reasonably high probability:
 wire = wire[0:512] + \r\n + wire[512:1024]

or similar.  Otherwise not. ;-)
___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] fixing the current email module

2009-10-08 Thread Stephen J. Turnbull

Barry Warsaw writes:
  On Oct 8, 2009, at 3:29 AM, Glenn Linderman wrote:

   Headers could possibly be a quadruple instead of a triple, with the  
   4th item being the wire format if received?

I think the whole input format (note, not necessarily wire!) should be
saved off on the top-level Message object (possibly in a file, per
Barry's comments about that).  Subobjects could then refer to to
pieces of that as position ranges.

  I think not a quad.  I think other APIs should be used to extract the  
  raw data, e.g.
  
# return a unicode or throw an exception
text = str(header)
# should always be okay even if gibberish
raw = bytes(header)
  
  or /something/ like that.

Does that work?  I would think (especially in parallel to text) you
want bytes(header) to be the wire format.  If so, you want it to raise
if it knows it contains gibberish.

And again, we have the problem of whether it should return with the
field name prepended or just the field body.

I have a feeling we should not try to decide what APIs we're going to
spell as __str__ and __bytes__ yet.
___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] fixing the current email module

2009-10-08 Thread Stephen J. Turnbull

Barry Warsaw writes:

  Yeah, idempotency probably is not the right term, though I think  
  historically that's what's been used.  Math geeks, what's the right  
  term here? :)

Invertability *is* the math term.  Roundtrip is more likely to make
sense to real people.

  I completely agree with you (of course :).

Other way around, I'm sure.wink

What-about-the-curmudgeon-behind-the-curtain-ly y'rs,


___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] fixing the current email module

2009-10-08 Thread Stephen J. Turnbull

Bill Janssen writes:

  I should point out that I also store lots of metadata in the registered
  MIME format text/rfc822-headers (defined in RFC 1892), data that doesn't
  necessarily conform to the specific set of headers mentioned in RFC822.
  It would be nice if the header support in the email package would also
  support reading and writing that format.

I'm not sure what you're saying here.  RFC 822 is inclusive.  More or
less, if it looks like a header, it is a header, and we need to parse
it at least into field name and field body, whether RFC 822 defines
more specific syntax for it or not.

Is that all, or do you mean you want it to give that MIME format
special treatment, such as a method for converting a Message object
containing a parsed RFC 822 message to a Message object containing a
multipart/report message and a text/rfc822-headers subobject, ready to
have the text/plain and message/delivery-status parts filled in per
RFC 1892?

  And MIME multipart is sometimes used in applications other than email.
  It would be nice if the MIME parsing part of the email module could be
  used for those purposes, as well -- basically without some of the
  headers defined in 2822 and 2821.

Ditto, here.

I would expect that you could feed an HTTP stream containing headers
and content to the Message constructor and get something sensible
back.  Dunno what Barry thinks of that, though.

___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] fixing the current email module

2009-10-08 Thread Stephen J. Turnbull

Glenn Linderman writes:

   Conversions will eventually be done.  Best it were done quickly.
  
  Disagree.  Deferring the conversions defers failure issues to the point 
  where the code (hopefully) somewhat understands the type of data being 
  manipulated, and can then handle it appropriately.  Converting up front 
  causes errors in things that may never be touched or needed, so the 
  error detection and handling is wasteful.

That's theory; my position is based on Mailman practice.  Don't believe
me, ask Barry.  I also spend most of my OSS time on the
internationalization of XEmacs, and the experience is similar there.
Best to convert everything as early as possible, or admit that you
don't know how.

  So for headers, which are supposed to be ASCII, or encoded via RFC rules 
  to ASCII (no 8-bit chars), then the discovery of an 8-bit char should be 
  produce a defect report, but then simply converted to Unicode as if it 
  were Latin-1 (since there is no other knowledge available that could 
  produce a better conversion).

No, that is already corruption.  Most clients will assume that string
is valid as a header, because it's valid as a string.

  And if the result of that is not expected by the client (your
  definition), then the client should either notice the defect report
  and reject it based on that, or attempt to parse it, and reject it
  if it encounters unexpected syntax.  After all, this is, for that
  client, raw user input (albeit from a remote source) so fully
  error checking the input is appropriate.

No way.  That environment would suck to program in.  And it's
un-Pythonic: Errors should never pass silently.

  Python way.  Since the email library is trying to avoid raising 
  exceptions in large blocks of its code, it is non-Pythonic

I disagree with that.  Unless explicitly silenced.  The strategy
that Barry and I favor is to signal errors lazily.  So we *explicitly*
silence errors (at least of the Exception kind) when parsing.  If we
can't parse, we look for a part terminator, encapsulate the bad stuff
and move on to the rest of the input.  Later, at use time, *if* the
unparsable object is used, *then* the error will be raised, hopefully
with enough metainformation to figure out what to do about it.

I don't see what's un-Pythonic about that.
___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] fixing the current email module

2009-10-07 Thread Stephen J. Turnbull

Glenn Linderman writes:

   If you mean that the email module will keep track of what form the
   object is currently represented by, that will eventually result in
   UnicodeError: octet out of range: 161, ascii.
  
  The above sentence does not communicate your meaning to me... or any 
  meaning, actually.  Can you explain?

Yes, that Unicode error is one that took years for Mailman to work
around.  If we are going to be converting different objects at
different times, I'm sure we'll get to see it agin in the future.  Oh,
joy.

  If conversions are avoided, then octets are unlikely to be out of 
  range?

Haven't looked in your spam bucket recently, I guess.  Spammers
regularly put 8 bit characters into headers (and into bodies in
messages without a Content-Type header), for one thing.

  And the email module must be aware of the form of the data in 
  order to manipulate it in any format other than wire format, but 
  fortunately, wire format declares the format of the data (not to say 
  there is not buggy wire format data -- but that is an issue best avoided 
  by avoiding as many conversions as possible).

Best I can't speak to; you obviously are willing to accept a much
higher error rate than I am.  Robust handling of buggy wire format
data means that the email module must do something sane with it before
giving it to the application.  Maybe it's reasonable to do that
lazily, and/or cache the result, but access to bogus data (that the
email module can determine is bogus or suspicious) must not be allowed
unless the client says hit me with your best shot explicitly.  Most
clients are simply not going to be prepared for the kind of crap I see
in /var/mail/turnbull every day.

  I was pushing back from your declaration that an archiver would
  always want string output

Please don't push back; we won't get anywhere.  Use cases are
*examples*, not complete specifications of all possible inputs and
outputs.  Use cases should be simple and clear cut.  If you want a
different use case, state it.  In fact in the real world, *all* of the
archivers I know of produce text formats on disk, either deleting
multimedia objects or saving them off and linking to them via URLs in
the text.  If you know of a different kind of archiver, add it as a
use case.
___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] fixing the current email module

2009-10-06 Thread Stephen J. Turnbull

Glenn Linderman writes:

  Yes, I interpreted, possibly misinterpreted, Barry's comment about 
  storing things as bytes, as that he was figuring to store them in wire 
  format.

What that means is unclear, though.  Does a header in wire format
mean before or after MIME encoding?  Probably after, but that's pretty
useless for the purpose of editing the header.  Does it include the
tag (the part before the colon) or not?  Etc.

  I would tend to agree with that, except that if something is 
  received/provided in a particular format, it might want to stay in that 
  format until such time it is needed in a different format... and then 
  the appropriate set of conversions (current format = internal format = 
  needed format) applied as needed, avoiding all conversions when it is 
  already in the needed format.

If you mean that the email module will keep track of what form the
object is currently represented by, that will eventually result in
UnicodeError: octet out of range: 161, ascii.

  two conversions are slower than none, and use 2-4 times the space in 
  string format.

Let's get this correct, *then* optimize, please.

  One has to write the conversion code anyway; it is just a matter of 
  where it is called.  Once converted, meta data could be retained in its 
  natural format.

Meta data for what?  Why would you convert meta data?

   2.  MUA #1: Composition.  Input will be strings and multimedia file
   names, output will be bytes.  Will attributes of message objects
   be manipulated?  Not in a conventional MUA, but an email-based MUA
   might find uses for that.
  
  I'm not sure what an email-based MUA is seems to me even a 
  conventional MUA is email-based???

Only if it's written using the Python email module.

   4.  Mailing list processor.  Message input will be bytes.
   Configuration input, including heading and footer texts that may
   be added are likely to be strings.  Header manipulation (adding
   topics, sequence numbers, RFC 2369 headers) most conveniently done
   with strings.  Output will be bytes.
 
  
  But the bulk of the message parts, received in wire format, may not need 
  to be altered to be sent along in the same wire format.

That depends.  For example, multimedia parts may simply be discarded,
in which case it makes sense to not convert them.  However, most
Mailman lists do add a footer, and because of crappy Windows MUAs that
don't implement MIME correctly, it's preferred to add that by
concatenating as text.  That simply cannot be done correctly in wire
format for any character set except ISO 8859/1.

  Heading and footing texts are configured boilerplate, and could be 
  cached in a variety of formats to avoid the need to convert them for 
  each message,

Premature optimization is the root of all error.

  An archiver could archive wire format,

Are you suggesting that the email module should mandate that?  We have
a severe tail-dog inversion problem here.

___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] Ensuring 7 bit encoding

2009-08-28 Thread Stephen J. Turnbull

R. David Murray writes:

   import email.message
   m = email.message.Message()
   m.set_payload(A few lines
  ... of 8-bit text
  ...
  ... One high bit character: Â².
  ... , 'us-ascii')
   print m.as_string()
  MIME-Version: 1.0
  Content-Type: text/plain; charset=us-ascii
  Content-Transfer-Encoding: 8bit
  
  A few lines
  of 8-bit text
  
  One high bit character: Â².
  
  
  
  Since 8bit isn't technically us-ascii, I wonder if this is a bug.

This is a bug.

___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] Generating zipped or gzipped attachment with emailpackage?

2009-05-21 Thread Stephen J. Turnbull

Mark Sapiro writes:

  Ideally, one would be able to specify a parameter on the Content-Type;
  header along the lines of
  
  Content-Type: text/csv; charset=utf-8; compression=gzip

No, I think this is really a content transfer encoding, not part of
Content-Type, and I don't see why one would be enough.  Nor would it
necessarily always be compression.

So how about a Content-Transfer-Filter header which resolves to an
(order-sensitive!) list of transformations:

Content-Transfer-Filter: pgp-encrypted; algorithm=idea; order=3
Content-Transfer-Filter: x-xz; order=2; comment=the successor to LZMA;
  alternate-application=x-lzma
Content-Transfer-Filter: base64; order=1

Order is decoding order here.  Otherwise you'd need a parameter to
determine which to use first (in case of corruption or reordering by
some brain-damaged MUA or MTA).

In the presence of a Content-Transfer-Encoding header, the
Content-Transfer-Encoding should be applied first, then any
Content-Transfer-Filters.

___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] API for Header objects

2009-04-17 Thread Stephen J. Turnbull

Tony Nelson writes:

  This example seems tortured and contrived.

Not at all.  I currently use grep, not the email package, but in fact
I extract several headers for use in mailing list moderation.  It's
getting to the point where my gradually accreting shell script doesn't
cut it (more because I'm recruiting additional moderators than because
I'm not happy with it), and if I'm going to do this in Python I
definitely want an obvious and elegant way to produce a displayable
string (ie, Unicode) because not all of the messages I get in Chinese
and Korean are spam.

  Custom code to extract a single header one time to send to someone?

That is precisely why we want a simple readable short elegant API.

Like str(msg['To']).

This also suggests the sequence interface of msg['To'] should not
contain tuples of strings, but rather NameAddr objects (taken from the
RFC 5322 grammar).  Then to flatten a NameAddr, use str or bytes as
appropriate.  So to present a list of addressees in a moderation
interface, you could use

recips = list(msg['To']) + list(msg['Cc'])

# We have a utf-8 codec on stdout, between us and the wire.
print(ul\n)
for recip in recips:
print(  li)
print(htmlesc(str(recip)))
print(/li\n)
print(/ul\n)

Of course for wire protocol, you just use bytes instead of str.
Hey! that's not bad, even if I do say so myself.

  Just hit reply and trim it yourself.

That won't work, for several reasons.

  If you must, you can use .get_header('X-Spam-Evidence').flatten().
  I doubt that anyone would actually do that, outside of a debugging
  session.

sigh /  I do it.

  No.  This is important, and you will not understand RFC x822 email
  until you understand this: email messages are not character
  strings.  They are byte sequences.  This confusion pervades the
  email package only because in Python before 3.x, bytes were
  represented as strings.

That's a bit generous and ungenerous at the same time.  The people who
worked on email were trying to come up with a reasonable interface
that on the one side treated wire format as bytes (Python 1.x, 2.x
str) and display format as text (Python 1.x str, oops, Python 2.x
unicode).  They failed, unfortunately, but not really because the
tools were unavailable.  They just treated the difficulties with
insufficient respect.  On the other hand, these difficulties are
inherent in the medium.  People (by which I mean nobody participating
in this thread) think of email as text.  MTAs think of email as octet
sequences.  Developers (especially Americans) have been sloppy about
that distinction for *five* decades, and because until 2000 at least
email was the sine qua non of networking, backward compatibility has
long demanded incorporating all those mistakes in current practice.

And now you're doing the same thing.  Email messages have at *least*
four ways of manifesting in our world that email-sig needs to worry
about: as byte sequences on the wire, as (mostly, anyway, and
certainly the headers) texts in our MUAs, as whatever-they-really-are,
and as the internal representation of the email package.  So depending
on which side of the argument you feel like taking, you insist
(inconsistently) that an email is a byte string or a header is not
a string at all, it's a structured thingie.  But it's not that easy.

What we need to do is come up with an API that respects all of those
aspects *simultaneously*, and allows us to elegantly but accurately
change the perspective we use to view this whatever-it-really-is.

  No, email is not text.  Email message bodies and some header fields
  may represent text.  An email message is a byte sequence.  One
  really needs to understand this in order to work with email at a
  low level.

Hm.  And here I was hoping that the email package would *implement*
the low level, leaving me free to think about high-level things.

  When one does not understand, then the email package should lead
  the user in the right direction.

No, thank you.  Python is a double-opt-in language.  We're all
consenting adults here.  Programmers who don't understand the RFCs are
likely to be surprised in many places, but they asked for it, they got
it.

___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] API for Header objects [was: Dropping bytes support in json]

2009-04-17 Thread Stephen J. Turnbull

R. David Murray writes:

  Note that while I want to be able to do str(someHeader) to get a
  string representation of a header body, I'm not so enamored of being
  able to do
  
   message['From'] = 'John Smith j...@foo.com'
  
  and have it get turned into a Header or AddressHeader object.
  Frankly, that looks too magical to me.

+1

Well, that would make it easy to write scripts that parse lists of
addresses and do things with them.  Eg, a mailing list manager's mass
subscribe interface.  That would be nice ... but on reflection it's
clear that we would want that to be parsed *strictly*.  So it raises
exceptions, which must be caught and handled, etc etc.  In other
words, it's actually not so easy to write scripts, no matter what you
do, and you also want to be able to specify what kind of magical
fixups (the ever-popular display-name with unquoted period
immediately comes to mind as one example) are acceptable, and which
are not, not to mention encoding for non-ASCII text.

How about unstructured header bodies, like Subject?  Should we allow
it, for convenience, or not, for consistency?

How about unknown fields, eg X-Are-We-Not-Structured-No-We-Are-Devo?

I think, in the first draft, we should be *consistent* in both cases.

___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] API for Header objects [was: Dropping bytes support in json]

2009-04-17 Thread Stephen J. Turnbull

Tony Nelson writes:

  No.  The useful data for an address field is *properly* a list of
  pairs of friendly name, address -- you should read RFC 5322 section
  3.4.

The fact that you think I didn't suggests there's really no point in
continuing to talk to you.  But I'll give it another try.

The issues we are dealing with at this point really have very little
to do with accurate implementation of the RFCs.  We all know that's
necessary, but ... it's a Simple Matter Of Programming.  At least,
that's why Postel, Crocker, et al put so much effort into writing the
RFCs, so it would be a SMOP.  I think they did a pretty good job.

I agree with you that we should make it relatively difficult to put
things that *don't* conform to the RFCs on the wire.  But that should
be the responsibility of the middleware that talks to the file system
and to the MTA.  I see no reason *at this stage* to burden MUA (in the
general sense) developers with all the RFC rules, and MDA/MTA writers
should only need to worry about it for error handling (__bytes__()
should normally do the job for them).  (For values of should
equivalent to in my dreams, I do fear.)

  This makes it very important that the easy way of doing things be
  the correct way.  With Address fields, that way is

Nonsense.  You are ignoring the fact that *people* (ie, nobody
participating in this threadwink) read an address field *as text*,
and they type in addresses *as text*.  We do not extract and inject
this information as pickles of Header objects via Firewire sockets
implanted in their skulls.  There is *no /unique/ correct way* here.

  For example (this is a trick question), in your opinion, what
  should
  
  msg['To'][0]
  
  return if the original header was
  
  To: Stephen J. Turnbull step...@xemacs.org
  
  ?
  
  ('Stephen J. Turnbull', 'step...@xemacs.org')
  
  You must be very confused to think this is a trick question.
  Try it with the current email package's email.utils.parseaddr().
  Again, see RFC5322 section 3.4.

But section 3.4 is not relevant to the trickiness, and parseaddr is
not strictly conforming.  See the definitions of name-addr,
display-name, phrase, word, atom, and atext in sections 3.2.3, 3.2.5,
and 3.4 of the RFC you cite.  Also see the definition of special.
Finally, I commend to your attention the definition of obs-phrase in
section 4.1, and the *very* special nature of this particular gotcha
as described there.

The point is that by parsing that and claiming it's an RFC 5322
section 3.4 name-addr, you have invoked the rather magical Postel
Principle.  You either have to say for my purpose I want magic in the
API (which you previously denied), or you have to admit that this is
harder than it looks.

It is true that section 4.1 says that the obsolete (interpreting)
syntax must be accepted *off the wire*.  So there certainly is a
justification for having a short obvious elegant spelling for make an
address Header into a sequence.  But IMHO that spelling should be
list(msg['To']), not msg['To'].

The rationale is that---assuming it can be implemented---several of us
would like to be able to spell wire format as bytes(msg['To']) and
display format as str(msg['To']).  I bet there are other uses that
would be well-served by such indirection.  And I would be disappointed
if we can't do way better than msg.get_header('To').flatten() to get
bytes---or should that be string?---out.

Internally, the Header whose .useful attribute is returned by
msg['foo'] will contain parsed data, referring to parsed tokens.
Flattening those parsed tokens will produce the original data.  Not
a problem at all, simple to implement, in the most direct way.
  
  And horrid to use, if you mean that the internal representation will
  be a full parse tree according to the augmented BNF in RFCs 822, 2822,
  5322, 2045-2049, etc etc., and that the only other way to access that
  data is via an arbitrarily defined .useful attribute (which, BTW, is
  quite unpythonic if you intend for it to be available as msg['foo'] as
  well: TOOWTDI).
  
  You put words in my mouth.

Of course I don't put words in your mouth.  The phrase if you mean
that clearly indicates that what follows is *my* understanding of the
implications of what you wrote.  I think that interpretation is quite
justifiable based on your insistence that the OOWTDI be your sequence
of (address, display-name) pairs.

  Wny assume that I am incompetent, or a fool?

I don't assume any such thing.  But I become less and less trustful of
your goodwill toward requirements other than your own.

  Of course the internal representation would include the full parse tree.
  Of course the external interface would provide read and write access to the
  relevent data.

Note that I didn't say it wouldn't.  I said it *would*.  But I think
it's justified, by what you have written so far, to expect that it
would be an inconvenient interface (maybe even horridly so).

  The .useful attribute (need

Re: [Email-SIG] API for Header objects [was: Dropping bytes support in json]

2009-04-17 Thread Stephen J. Turnbull

R. David Murray writes:

  put Header objects into it.  I don't think the overhead of
  having to do
  
   message['Subject'] = Header('subject string')

Hm.  Should a Header know which header it is?  Ie, should that be

message['Subject'] = Header('subject', 'subject string')

?  (I assume you would be less than in love with having the assignment
magically stuffing Subject into the Header as it gets assigned.)
___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] API for Header objects [was: Dropping bytes support in json]

2009-04-14 Thread Stephen J. Turnbull

Tony Nelson writes:

  Assuming that by Destination you mean a class of Address header fields,
  as there is no Destionation: header field, such header fields contain
  addresses, which can be considered to contain (as the email package does) a
  list of (name, email address) pairs, or, at a lower level, to also have
  Comments, there is indeed only one correct choice, which is the one the
  email package currently provides the diligent user.  I wish it to be the
  one obvious choice, so that less study is needed to properly use the email
  package.

As you point out above, display names and comments are different.
It's *not* obvious to me that they should be confounded by default.

In any case, it would certainly be possible to implement both the
indexing feature, so that msg['To'][0] returns a (display, mailbox)
tuple, and a converter so that list(msg['to']) returns a list of such
tuples (in both cases, assuming that most users prefer not to
distinguish comments from display names).

___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] [Python-Dev] Dropping bytes support in json

2009-04-10 Thread Stephen J. Turnbull

Shouldn't this thread move lock stock and .signature to email-sig?

Barry Warsaw writes:

   It does seem to make sense to think about headers as text header
   names and text header values.
  
   I disagree.  IMHO, structured header types should have object values,
   and something like
  
  While I agree, there's still a need for a higher level API that make  
  it easy to do the simple things.

Sure.  I'm suggesting that the way to determine whether something is
simple or not is by whether it falls out naturally from correct
structure.  Ie, no operations that only a Cirque du Soleil juggler can
perform are allowed.

  I agree that the Message class needs to be strict.  A parser needs to  
  be lenient;

Not always.  The Postel Principle only applies to stuph coming in off
the wire.  But we're *also* going to be parsing pseudo-email
components that are being handed to us by applications (eg, the
perennial control-character-in-the-unremovable-address Mailman bug).
Our parser should Just Say No to that crap.

  see the .defects attribute introduced in the current email  
  package.  Oh, and this reminds me that we still haven't talked about  
  idempotency.  That's an important principle in the current email  
  package, but do we need to give up on that?

Idempotency?  I'm not sure what that means in the context of the
email package ... multiplication by zero?wink  Do you mean that
.parse().to_wire() should be idempotent?  Yes, I think that's a good
idea, and it shouldn't be too hard to implement by (optionally?)
caching the whole original message or individual components (headers
with all whitespace including folding cached verbatim, etc).  I think
caching has to be done, since stuff like did the original fold with a
leading tab or a leading space, and at what column and so on seems
kind of pointless to encode as attributes on Header objects.

[Description of MessageTextView and MessageWireView elided.]

  This seems similar to Glyph's basic idea, but with a different spelling.

Yes.  I don't much care which way it's done, and Glyph's style of
spelling is more explicit.  But I was thinking in terms of the number
of people who are surely going to sing Mama don' 'low no Unicodes
roun' here and squeal codec WTF?! outta mah face, man!
___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] [Python-Dev] the email module, text, and bytes (was Re: Dropping bytes support in json)

2009-04-10 Thread Stephen J. Turnbull

Bill Janssen writes:
  Barry Warsaw ba...@python.org wrote:
  
   In that case, we really need the
   bytes-in-bytes-out-bytes-in-the-chewy-
   center API first, and build things on top of that.
  
  Yep.

Uh, I hate to rain on a parade, but isn't that how we arrived at the
*current* email package?
___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] [Python-Dev] Dropping bytes support in json

2009-04-09 Thread Stephen J. Turnbull

Barry Warsaw writes:

  There are really two ways to look at an email message.  It's either an  
  unstructured blob of bytes, or it's a structured tree of objects.

Indeed!

  Those objects have headers and payload.  The payload can be of any  
  type, though I think it generally breaks down into strings for text/ 
  * types and bytes for anything else (not counting multiparts).

*sigh*  Why are you back-tracking?

The payload should be of an appropriate *object* type.  Atomic object
types will have their content stored as string or bytes [nb I use
Python 3 terminology throughout].  Composite types (multipart/*) won't
need string or bytes attributes AFAICS.

Start by implementing the application/octet-stream and
text/plain;charset=utf-8 object types, of course.

  It does seem to make sense to think about headers as text header names  
  and text header values.

I disagree.  IMHO, structured header types should have object values,
and something like

message['to'] = Barry 'da FLUFL' Warsaw ba...@python.org

should be smart enough to detect that it's a string and attempt to
(flexibly) parse it into a fullname and a mailbox adding escapes, etc.
Whether these should be structured objects or they can be strings or
bytes, I'm not sure (probably bytes, not strings, though -- see next
exampl).  OTOH

message['to'] = b'''Barry 'da.FLUFL' Warsaw ba...@python.org'''

should assume that the client knows what they are doing, and should
parse it strictly (and I mean be a real bastard, eg, raise an
exception on any non-ASCII octet), merely dividing it into fullname
and mailbox, and caching the bytes for later insertion in a
wire-format message.

  In that case, I think you want the values as unicodes, and probably  
  the headers as unicodes containing only ASCII.  So your table would be  
  strings in both cases.  OTOH, maybe your application cares about the  
  raw underlying encoded data, in which case the header names are  
  probably still strings of ASCII-ish unicodes and the values are  
  bytes.  It's this distinction (and I think the competing use cases)  
  that make a true Python 3.x API for email more complicated.

I don't see why you can't have the email API be specific, with
message['to'] always returning a structured_header object (or maybe
even more specifically an address_header object), and methods like

message['to'].build_header_as_text()

which returns

To: Barry 'da.FLUFL' Warsaw ba...@python.org

and

message['to'].build_header_in_wire_format()

which returns

bTo: Barry 'da.FLUFL' Warsaw ba...@python.org

Then have email.textview.Message and email.wireview.Message which
provide a simple interface where message['to'] would invoke
.build_header_as_text() and .build_header_in_wire_format()
respectively.

  Thinking about this stuff makes me nostalgic for the sloppy happy days  
  of Python 2.x

Er, yeah.

Nostalgic-for-the-BITNET-days-where-everything-was-Just-EBCDIC-ly y'rs,
___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] email.header.decode_header eats my spaces

2007-03-28 Thread Stephen J. Turnbull

Barry Warsaw writes:

  Steve writes:

   IMHO, the Header class should be abstract, and there should be
   subclasses that handle dates, lists of addresses, lists of
   message-ids, etc.

  I'm not sure inheritance is the right way to organize this.

I picked inheritance because I see the header type as being fixed at
Header instantiation (I can't think of a use-case for changing a
From header to a Subject header, while Message-ID and
Resent-Message-ID would be handled by the same class), but there are
some things (handling folding, parsing the field name and body) that
are common to all headers.

I would be happy with any scheme that has the property that given a
field name, the semantics of its contents are fixed according to the
field if it is registered, or treated as *text with caution (maybe
extra warnings? etc) if the field is not registered.

  Or, maybe inheritance is right.  In any case, I think you also want  
  to also have a registry of some sort

Indeed I do!


___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

38 matches

Mail list logo