[Email-SIG] Some parsing/generation issues of email in Python 3
Hans-Peter Jansen writes: > Dear audience, > > when coming back to this list, I couldn't believe my eyes because > of the low volume level, but after rechecking with the archives, I > have to accept, it is that quiet here, a bit too quiet from my > POV. Hmm. It's just that very few people (one or two) are working on the module and in my experience it has been rock-solid compared to either Python 2.7 email or the package distributed with Mailman 2.1. I doubt very many people are using Python 3 email on high-volume mailstreams yet, as the high-performance networking (eg, Twisted) and perhaps some other libraries were late to be ported. > I was quite astonished to find out, that this procedure isn't > working that well anymore: the email module appears way more > sensible in the current state. This is a bit disappointing, as > reading the docs conveys, that some effort was put into reliability > and robustness. Given the much improved unicode handling of Python > 3 itself and the ever improving experience in handling emails, this > is contrary to my expectations, I have to confess. It's a complete rewrite from first principles. It's more robust in principle and more maintainable in practice, but faced with 100s of millions of emails (aka "tsunami of sewage"), the robustness can't be guaranteed. I'm willing to bet it will converge to "robust in practice" much faster than the previous design did. > Minutes after switching to the new code, I stumbled across a traceback in > msg.get_all('to') from a header like this: > > To: unlisted-recipients: ;, > ""@pop.kundenserver.de (no To-header on input) > > Hmm, not nice. http://bugs.python.org/issue27257 The header arguable fails to conform to RFC 5321, though it's syntactically permissible in RFC 5322. (See my comment on the issue.) > All these issues were harvested in less than halve an hour. What > really troubles me is the quietness around here in the light of > this experience. Doesn't people use Python (3) yet/anymore for > these kind of tasks? Probably not. > Does somebody care? email 5 for Python 3 is a complete rewrite from first principles. Yes, somebody cared. > Am I missing something? Patience and understanding of how opensource software development works, perhaps. ___ Email-SIG mailing list Email-SIG@python.org Your options: https://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] API for email threading library?
Bill Janssen writes: I think I'll finesse this issue with another (appropriate) layer of indirection. OK by me (can't bring myself to +1 on a thoughtful finesse. :) In a Lisp implementation of http://www.jwz.org/doc/threading.html I'm working on, I just use symbols named by the message IDs themselves; Yes, that works well for a static persistent representation. Lisp message threading? What's that in aid of, if you can say? The VM MUA for Emacs and XEmacs. RFC 5256 mentions it, but I had to go back to 2822 to figure it out. Tee-hee-hee! The wild, wonderful world of RFCs: You are in a twisty maze of ABNF, all alike ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] header folding
Glenn Linderman writes: To me, wrap means to divide and join as necessary a set of lines (sometimes/often a paragraph) to achieve some number of similar length lines, not to exceed a line length limit, with possibly a shorter one at the end. Typically such usage is in contexts where a paragraph is represented as a single physical line, though. Your set is not part of wrap in my dialect. I think that if these terms are defined in the RFCs, that those definitions should be preferred to mine. Fold is defined per RFC 5322. The others don't seem to be. I think fold should be used for the well-defined operation of header folding (RFC 5322) and also for the well-defined operation of inserting a soft linebreak in quoted-printable bodies (RFC 2045). I'm happy with whatever usage others prefer for the other operations. ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] header folding
R. David Murray writes: Hmm. Makes sense to me. So you'd rather the method were called fold and that refold_source remains the name of the policy control. Yes. What's the word for what is done when a text message is made to have a line length of less than 78 by using quoted printable (or base64) encoding? RFC 2045 discusses insertion of soft line breaks; it doesn't mention a term like folding. Folding seems like a good term to me, though. Note that the RFC 2045 definition of quoted-printable says that physical line length MUST be 76 characters or less, including any terminating = but not the CRLF pair that separates lines. Can anyone see a use case for controlling folding of headers separately from folding of message bodies? I haven't thought of one, which is why I'm thinking one policy knob controls both. The RFCs' treatments differ somewhat. RFC 5322 has both a MUST NOT and a SHOULD NOT exceed limit on line length (998 and 78 characters, not including the CRLF, respectively). RFC 2045 quoted-printable has only the MUST NOT limit of 76 (but the difference in limits is not a big deal). It's not clear to me what exactly the policy knob you're talking about is for body text. There is no policy really allowed if quoted- printable is being used. So the policy knob is whether to use quoted-printable to limit physical line length? The only reason I can think of for having separate controls is that many MUAs mishandle quoted-printable in the body text. Patches don't apply, one-time-key URLs in links get broken and fail to be recognized. On the other hand, header-folding rarely has such consequences in my experience. ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] header folding
R. David Murray writes: That's an interesting point. So perhaps I should rename the control 'header_source_refold'. I don't know have a strong opinion, but I tend to think it's unnecessary. On the other hand, we could also provide a separate control for whether or not quoted printable bodies in particular were folded, If the body is already known to be quoted-printable, you don't really have a choice. Folding lines longer than 76 characters after quoted-printable encoding is required by RFC 2045. Of course you can do more folding than necessary (eg, fold an 85-character line at 35 and 70 characters), but that doesn't seem very useful to me. It seems to me that the policy question (if it exists) is We have an all-ASCII body with 'long lines'. Shall we encode in quoted-printable only for the purpose of folding the long lines? ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] header folding
Barry Warsaw ba...@python.org writes: That's at least what I think of, and I do think we could have two knows to control the different functionality: - To 'split' a line means to take a line longer than a specified maximum, and make it fit into the maximum line length, splitting at whitespace or other semantic separators. In the case of headers, folding is hallowed usage (going back to at least RFC 733), and is very precisely defined by RFC 5322. If we are going to do something non-RFC conformant (yeah, right, we might do that, eh?), splitting would be better. If our implementation is intended to be conformant, I think folding is preferable both for familiarity and ease of reference (look it up in RFC 5322). I think the generalization to bodies is reasonable, although I haven't found any RFC usage of folding in that context in a quick look. - To 'fill' a header means to take the logical contents of the header and recombine and resplit it so that each line is as close to the maximum line length as possible. My analogy here is Emacs's M-q (fill-paragraph). What then is [...] wrapping? Maybe no different than the above. In my dialect, what you describe as filling is (at least potentially) far more sophisticated than what I mean by wrapping. Wrapping moves forward through each line and at the maximum length backtracks to the rightmost break point in the line, breaking there, then continuing the process in the tail line. This could and often in my experience does result in very uneven lines. However, I don't think we're talking about filling here. Filling IMHO should be implemented by the email module, but it should be called explicitly by the client, not imposed internally on the basis of a global policy. Consider the following ugly header (which is somewhat unlikely to actually appear in a real use case, although it could easily result from cut-and-paste into an MUA's to field): To: Amie Cawinski a...@abc.org, Ichabod Tallman i...@cow.org (there is no trailing whitespace on either line). IMO, there are two plausible fillings (assuming a limit of 78 characters) here: To: Amie Cawinski a...@abc.org, Ichabod Tallman i...@cow.org and To: Amie Cawinski a...@abc.org, Ichabod Tallman i...@cow.org of which the second will be uglified by a RFC-5322-conformant processor into: To: Amie Cawinski a...@abc.org,Ichabod Tallman i...@cow.org (note the extra space after the comma). I personally don't consider either of To: Amie Cawinski a...@abc.org, Ichabod Tallman i...@cow.org To: Amie Cawinski a...@abc.org, TABIchabod Tallman i...@cow.org plausible as a presentation, but YMMV. So filling (to me) is about presentation, not protocol conformance. Anyway, I don't see how we can justify making *these* choices for the user on the basis of a policy that really is about conservative compliance to a wire protocol standard. For example, I personally do not fill 81-character subject headers; it's just too ugly. However, I might want my mail program to conservatively fold them, especially for certain correspondents known to be stuck behind weird MTAs or MUAs. You might have a message body that contains code, in which case you might want to fill the headers (using the terminology above), but not fill the body. That's another example of why control for filling has to be flexible (and why IMHO filling should be called explicitly by the client). However, if the receiving MUA is RFC 2045-conformant, the user cannot tell that quoted-printable folding was used. ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
[Email-SIG] header folding
R. David Murray writes: the end. Basically, BaseHeader gets a 'wrap' method, and there is a new policy control, 'refold_source' (I'll probably rename it to 'rewrap_source', since I expect to apply it also to message bodies). This bothers me. Folding and wrapping are two different things. Folding is about invertibly reformatting a single logical line to make machines happy during transmission, what wrapping does is not 100% clear to me but it's about making people happy. (I put does in quotes because it's not obvious to me that the source of wrapped text necessarily is a single anything, nor that wrapping need be invertible.) I grant that people and many MUAs take a different point of view about header folding, but clearly the RFCs have moved away from placing any importance on presentation aspects toward specifying an invertible transformation exactly. On the other hand, I think that wrapping should place emphasis on presentation. ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
[Email-SIG] question on syntax of 'group' in address-list
R. David Murray writes: I've gone through the RFCs and done some additional googling, and haven't been able to confirm the answer to this question: what exactly is the syntax when a group is included in an address-list? (See http://tools.ietf.org/html/rfc5322#section-3.4). The question is, if another address follows the group, are they separated from each other by ';' or by ';,'? The ABNF seems to call for the latter, but I can't find any example showing it. I'm sure that I should accept both on input, Why? I mean, YAGNI. but I'd like to generate the correct form. Does anyone have confirmation or contradiction for my interpretation? From RFC 822. The Cc field contains two groups, separated by ,, with each group terminated by ;. A.3.3. About as complex as you're going to get Date : 27 Aug 76 0932 PDT From : Ken Davis kda...@this-host.this-net Subject : Re: The Syntax in the RFC Sender : KSecy@Other-Host Reply-To : Sam.Irving@Reg.Organization To : George Jones gr...@some-reg.an-Org, Al.Neuman@MAD.Publisher cc : Important folk: Tom Softwood ba...@tree.root, Sam Irving@Other-Host;, Standard Distribution: /main/davis/people/standard@Other-Host, Jonesstandard.dist.3@Tops-20-Host; Comment : Sam is away on business. He asked me to handle his mail for him. He'll be able to provide a more accurate explanation when he returns next week. In-Reply-To: some.string@DBM.Group, George's message X-Special-action: This is a sample of user-defined field- names. There could also be a field-name Special-action, but its name might later be preempted Message-ID: 4231.629.XYzi-What@Other-Host ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] [Python-Dev] email package status in 3.X
l...@rmi.net writes: FWIW, after rewriting Programming Python for 3.1, 3.x still feels a lot like a beta to me, almost 2 years after its release. Email, of course, is a big wart. But guess what? Python 2's email module doesn't actually work! Sure, the program runs most of the time, but every program that depends on email must acquire inches of armorplate against all the things that can go wrong. You simply can't rely on it to DTRT except in a pre-MIME, pre-HTML, ASCII-only world. Although they're often addressing general problems, these hacks are *not* integrated back into the email module in most cases, but remain app-specific voodoo. If you live in Kansas, sure, you can concentrate on dodging tornados and completely forget about Unicode and MIME and text/bogus content. For the rest of the world, though, the problem is not Python 3. It's STD 11 (which still points at RFC 822, dated 1982!) It's really inappropriate to point at the email module, whose developers are trying *not* to punt on conformance and robustness, when even the IETF can only run in circles, scream and shout! Maybe there are other problems with Python 3 that deserve to be pointed at, but given the general scarcity of resources I think the email module developers are working on the right things. Unlike many other modules, email really needs to be rewritten from the ground (Python 3) up, because of the centrality of bytes/unicode confusion to all email problems. Python 3 completely changes the assumptions there; a Python 2-style email module really can't work properly. Then on top of that, today we know a lot more about handling issues like text/html content and MIME in general than when the Python 2 email module was designed. New problems have arisen over the period of Python 3 development, like domain keys, which email doesn't handle out of the box AFAIK, but email for Python 3 should IMHO. Should Python 3 have been held back until email was fixed? Dunno, but I personally am very glad it was not; where I have a choice, I always use Python 3 now, and have yet to run into a problem. I expect that to change if I can find the time to get involved in email and Mailman 3 development, of course.wink ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] invertability and idempotence
Andrew McNamara writes: The discussion had referred to idempotency up until that point, and I didn't want to introduce new terminology. But referring to this: generate(parse(msg)) == msg as idempotency is perfectly valid in my opinion (as in, applying an operation multiple times produces the same result). That would be generate(generate(msg)) == generate(msg) or parse(parse(email)) == parse(email). The input and output of these functions are of *different types*, they cannot possibly be idempotent. I'm +1 on changing to use invertible, -0 on continuing to use idempotent (since it's the traditional idiom), and -1 on using idempotent to mean is deterministic, ie, generate(msg) == generate(msg). If msg changes state in an irrelevant way, it would be nice to produce the same output from generate. But that is not idempotency. And we would need to specify precisely what irrelevant means. For example, if a client of the Message class decides to specify the MIME boundary explicitly, then the output of generate has to change IMO. OTOH, many MIME implementations put the time of day or the generating process into the MIME boundary. This is unnecessary (boundaries need to be unique only message-wide, and the email package can adjust the boundary to not conflict with message content, eg, Emacs/Gnus uses something like -=-=-=-=- by default), and I would hope that email avoids such practices when possible. ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] invertability and idempotence
Andrew McNamara writes: didn't want to introduce new terminology. But referring to this: generate(parse(msg)) == msg as idempotency is perfectly valid in my opinion (as in, applying an operation multiple times produces the same result). That would be generate(generate(msg)) == generate(msg) or parse(parse(email)) == parse(email). The input and output of these functions are of *different types*, they cannot possibly be idempotent. You're splitting hairs - the operation generate(parse(X)) is idempotent, and that's what I was referring to. Yes and no. The equation above does imply idempotency, but it is a much stronger statement: generate(parse()) is the identity. That stronger statement could be useful in practice, but it could also be expensive to implement. That tension could engender flamewars if the requirement is expressed by the word idempotency but the intent is identity. For example, suppose that for MIME multipart messages, generate() uses $%$%$%$%$%$ as the separator as long as no component contains that string. Then generate(parse(msg)) will be *equivalent* but not *identical* to msg for most messages received from non-Python-email- using MUAs. generate(parse()) is idempotent, though. I don't think the folks who ask for idempotency would be satisfied with that! As I said earlier, if we're going to use the word idempotent to mean invertible, that's established practice, so we footnote the Humpty-Dumpty-ism, and I can live with that. But if we're going to try to be more accurate, let's be fully accurate. ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] fixing the current email module
Barry Warsaw writes: I would proposal a radical suggestion: we treat backward compatibility the way Python 3 did. Nice to keep, but we can throw it over the side in order to fix the warts. We'll worry about migration strategy later. +1 Aside: I would really like to have a much more @property based API where appropriate. +1 E.g. Message.get_content_type() would be Message.content_type. And in this case we'd probably have message.payload_bytes or some such. Decoding may require additional parameters so it will probably be a method. Maybe, but in general those parameters can be deduced from the metadata. If we can use those defaults often enough, then the default-decoded version can be a property too. We would have to provide alternatives, though. I've seen Shift JIS encoded Japanese labelled ISO-2022-JP, and apparently many Japanese MUAs actually decode that to Japanese! Not suggesting that we should do the same, but probably the generic function that is used to decode should be exposed as a method so that clients who encounter such nonsense can deal with it, and override any of the metadata. ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] fixing the current email module
Glenn Linderman writes: conformant is not in the dictionaries I've consulted. Try these (top 3 goggle results for conformant): conformant- WordWeb dictionary definition (computing) conforming to a particular specification or standard In this paper we present a new approach to conformant planning. Nearest ... www.wordwebonline.com/en/CONFORMANT - Cached - Similar - conformant - Definition from the Merriam-Webster Online Dictionary conformant can be found at Merriam-WebsterUnabridged.com. Click here to start your free trial! Click here to search for another word in the Merriam-Webster ... www.merriam-webster.com/dictionary/conformant - Cached - Similar - Conformance The notion of TEI conformance is intended as an aid in describing the format and contents of a particular document or set of documents. ... www.tei-c.org/Guidelines/P4/html/CF.html - Cached - Similar - A quick look at some of the results show that the word conformant is typically used in a section called conformance, which defines what criteria are used to determine if an application is following the standard or not. OTOH, the fact that the top three results are dictionary definitions suggests an awful lot of people are looking up the word in dictionaries Conforming is mostly a verb, not an adjective. Goggling gives Results 1 - 10 of about 3,680,000 for conforming application, but Results 1 - 10 of about 324,000 for conformant application. Looks like conforming is the preferred adjectival form. but conformable and compliant are synonyms. When used to mean submissive. Conformable won't do. English is hard enough for ESL folks when they can find the words in the dictionary. Compliant does seem to be the winner. Results 1 - 10 of about 13,900,000 for compliant application. Conformant or conforming is better IMHO but much less popular. Tie goes to the lusers, as usual. ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] fixing the current email module
I'm running out of time to work on this (yeah, I know it's the weekend, but my life is like that lately). I think we're converging, though, so I'd like try and tie some of those ends together. Glenn Linderman writes: On approximately 10/9/2009 8:10 AM, came the following characters from the keyboard of Stephen J. Turnbull: Actually, I would say you are emitting leniently, in violation of the Postel principle. You can say that, but I don't have to believe it. I'm talking about accepting; the message has arrived, it is here, the client is trying to look at it, and I'm talking about ways the client can look at not-quite-perfect data, knowing that it is not quite perfect, but still being able to see it. I'm not at all talking about emitting data. It would be indeed, if the corrupt data is stored in the place where correctly decoded data normally is stored, and is accessible in the same way. But I gather that's not what you were talking about, my mistake. You seem to be calling the email package helping the client to accept not-quite-perfect data, as a form of emitting data. It is not. No, I was confused by the way you wrote. Saving the data *somewhere* is absolutely necessary; not losing data is the #1 commandment of low-level mail processing. Surely the email module is subject to that commandment. *Nobody* is talking about losing any data yet, except Barry indirectly when he says that some people think giving up on invertibility (often called idempotency), and even he is quite adamant that he's not going to give up on that. So when you wrote about saving and converting to text form, without mentioning that the specific APIs, I assumed you meant the mainline APIs for parsing and accessing parts of a correctly formatted message. The email package cannot police the client... if it chooses to eat it in a single gulp without looking at it then it may get indigestion. I never suggested that converting to Unicode as if it were Latin-1 should be done without informing the client, or being requested by the client to do that via a special API call... Well, maybe I misread it, but it certainly looked like that to me. I would not object to that special API call defaulting to ISO 8859/1. If you ignore defect reports, you are ignorant (blunt, but not intended to be offensive). What I worried about is that if defect reports are present, *but displayable data is also present*, programmers *will* simply display it, for example in producing a prototype program. It will be impossible to determine without very close analysis of that program that an early version became a production version without adding appropriate checks. In practice, this bug will be discovered when some end user's installation breaks. It seems that you agree with this, and because the special API call is necessary, it will be easy to identify whether proper care is being taken or not. Right? It is still raw user input, and should still be checked for proper syntax by the client, Nonsense. The email module had better know a lot more about syntax than the client. If it doesn't, whack it with a 2x4 until it learns! I think we are talking at cross purposes here. I find it quite difficult to follow where you cross the boundary between talking about one sort of email package client, and then switch to another type, or switch to the responsibilities of the email package. Excuse me? The raw user input you referred to above is material that the client software receives from the email package. The email package should give it to the client in the normal (convenient) way only if it can certify that it conforms to the appropriate standard. That standard should be specified in the API documentation. Any more detailed structure, of course, is the responsibility of the client. An application which is using email as a transport, has specific goals, which require specific content. You were mentioning clients. I've already said that when I speak of an MUA, I write MUA. In speaking of the calling program, which might even be a user running the module via the Python interpreter, I write client. It's a very convenient way to describe the user of an API, in contrast to the provider of the API (the implementation). If such a client doesn't validate the syntax of that content, it isn't much of an application. If that MUA or email application uses RFC 822 addresses, it should be able to rely on the email module to parse those addresses correctly, or provide a defect report. One might even go so far as to suggest that it be able to parse the (non-RFC, but very common) + notation for separating the mailbox from additional data used for VERP and challenge-response applications. That would have to be documented, but if so documented client applications like the MUA should be able to rely on it (and you can bet many will). Application domain syntax
Re: [Email-SIG] fixing the current email module
Glenn Linderman writes: On approximately 10/9/2009 3:08 PM, came the following characters from the keyboard of Tokio Kikuchi: Your suggestions 1)-4) are not accesptable to Japanese users at all. If a message with an encoded header arrives (like your number 2 sample) but it cannot be decoded, what action _is_ acceptable to Japanese users? And what action is implemented in Mailman (if different)? I know a fair bit about Japanese (both the language and the users), and I'm having difficulty understanding what Tokio means, given your list of hypotheses. I suspect he's basically rejecting the hypothesis that it can't be decoded -- if it can't be decoded, then learn how to do so! I can think of a 5th technique... don't modify the header, and send it through unchanged. Now I think I've covered the gamut of possibilities, I agree. However, I think we're way out of bounds here. We already know how to decode anything that RFC 2047 can throw at us in charsets that Python can handle. Anything that can't be decoded then is seriously malformed from the point of view of the mailing list users. So why are we discussing this? We don't even know what our mainline APIs are going to look like, why are we discussing forcibly operating on broken input? [[ Aside: with an appropriate translation for Re: ). Re is a Latin abbreviation; there is no appropriate translation. ;-) ]] MUAs or mailing list handlers that attempt to retain what was sent (idempotency or invertibility), would be more likely to do what I describe, and are more robust when faced with new character sets that they don't understand how to decode. Maybe they are, but the email module doesn't know or care about what they do. Let's stick within what the email module is supposed to handle. ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] fixing the current email module
R. David Murray writes: I have set up two more documents on the wiki. One is UseCases[1], [...]. The other is a Glossary[2]. Thank you, very much! I think most of it accurately reflects the consensus here, but in it I'm proposing to use the term 'transfer-decoded' for #3, and 'transfer-encoded' as an alternative to 'wire-format' just for symmetry. Comments and suggestions welcome. 'Wire-format' means you can cat it to the wire, ie, RFC-conforming (in fact, it's the only meaning in the RFCs by definition), and for email itself it's always bytes AFAIK (Mama don' 'low no XML roun' here, Lord, Lord!). That's not true of all our applications, though, especially stuff like doctests. There are also some RFCs we use such as BASE64 (specifically relevant to transfer encodings) that are defined in terms of characters, not bytes, so 'transfer-encoded' is slightly different from 'wire-format'. I think in general that kind of comment should be applied directly to the Glossary, but what deserves general discussion is how pedantic do we want to be? I think the distinction made here between 'wire-format' and 'transfer-encoded' is useful *to us*, and in general lean toward high pedantry (cf how much smoke and how little fire Glenn and I are generating!) WDOT? ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] fixing the current email module
Glenn Linderman writes: (I switched conformant to compliant, Conformant is in common use. You might be more comfortable with conforming. Richard Stallman points out that you comply with the law, but you conform to a standard. I think it's useful to make that semantic distinction, cf. RFC 2119 MUST vs. SHOULD or MAY. ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] fixing the current email module
Oleg Broytman writes: On Wed, Oct 07, 2009 at 11:23:24AM -0500, Matthew Dixon Cowles wrote: In my opinion, the email module should never raise an exception as a result of working with a malformed message. Though it should certainly make the information that a message was malformed available for the calling program to check. I disagree. email package is not a user agent, and exceptions are *the* way to indicate there are problems. Although practicality beats purity. The email package has access to the wire format, and knows what to do with most of it. It should DTRT where that is possible, and punt where not. By punt I mean return a special object containing as much of the meta data for an object as it could recover, along with the data itself as a blob. I would suggest that module utilities that require access to the parsed form of data be designed as object methods. The special objects produced when broken wire format is encountered wouldn't have those methods, and thus they'd fail the duck type test. But that makes sense: that duck can't quack anyway. So this gives our (== Matt and me) desideratum that email never raises (it's the Python runtime that will raise AttributeError), and also Oleg's (in part, anyway): an exception *will* be raised. I think (== hope) that this will sufficiently localize the issues that even though only AttributeError would even be raised, it will be obvious what went wrong. Then the calling program must catch all exceptions That is just unreasonable. There are too many ways for things to go wrong. If you have just one exception for all problems, it's easy to catch them all, but then the client doesn't know what went wrong, and has to partially parse the unparsable itself. That's nuts; the reason for using the email module is to delegate that in the first place, and besides, to the extent it's possible, the module has presumably done that. OTOH, a long list of precise exceptions is both a maintenance burden on the email module and on client programmers. Yes, if email parse a message in some way - ok. You can help by creating more intelligent parser(s). But if a parser stumbles upon an unparseable block - it must raises an exception. No, that's the last thing you want it to do. Suppose you have Content-Type: multipart/alternative Content-Type: text/plain Content-Type: text/html; body-parseable=no Clearly you want (a) a vanilla email client to just grab the text/plain part, and (b) a client written by somebody whose boss uses BustedMUA[tm] to be able to try to parse the text/html part, using the special rules that apply to the jumble produced by BustedMUA. In other cases, you might be able to find a valid part terminator, but the header of that part was hosed. So the whole part becomes a blob, but the parser should resync at that point, and start parsing following parts. I can think of no input for which the parser should *ever* throw an exception. Utilities that depend on a particular object's parsed form might have do so, but even then it should be avoided if at all possible. ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] fixing the current email module
Glenn Linderman writes: If conversions are avoided, then octets are unlikely to be out of range? Haven't looked in your spam bucket recently, I guess. Spammers regularly put 8 bit characters into headers (and into bodies in messages without a Content-Type header), for one thing. I'm aware of that, but if conversions are not done, octets are unlikely to be _reported_ to be out of range Conversions will eventually be done. Best it were done quickly. Most clients are simply not going to be prepared for the kind of crap I see in /var/mail/turnbull every day. Are you referring to most email clients, or most Python-email-library-using clients? Sorry. When I mean MUA I try to say MUA. By client, I'm referring to the higher level logic that is going to be calling the email module. Is it your point of view, then, that incorrectly formed email should be mostly treated as SPAM? Heavens no! Not by the email module, anyway! The email module should not know about spam (but see Barry's we're having spam for Launchpad post: if you're that good, anything goes!), except maybe at a very high level. Your hit me with your best shot comment indicates that you want a failure code or exception when the data is bad, and then a way to retry accepting errors? My curent thinking is that the email module should return an object representing a partial parse. The way that you find out if it is partial is to try to access some data that should be in the object. If the parse succeeded, the accessor returns the data (which might be empty). If the parse did not succeed, you get an AttributeError. (This is just a paraphrase of what I wrote in response to Oleg.) ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] fixing the current email module
Oleg Broytman writes: where not. By punt I mean return a special object containing as much of the meta data for an object as it could recover, along with the data itself as a blob. The special object is an instance of an exception class ;) It could be, but it will be returned with return, not raise. ;) I think (== hope) that this will sufficiently localize the issues that even though only AttributeError would even be raised, it will be obvious what went wrong. Not exactly. One can see an AttributeError, but what was the cause? why a parser has created a broken object? AttributeError doesn't preserve information from parser. Who said it wouldn't? Granted, I didn't say it would, but in my Content-Type: multipart/alternative Content-Type: text/plain Content-Type: text/html; parseable=no example, I would expect the object returned to reflect that structure. In particular the object representing the second MIME part would indeed possess a valid Header member. I would also attach the original data (which in the case of a missing separator might very well overrun into other parts, etc), but it would *not* be accessible via the usual methods (eg, definitely not from .flatten()). So in fact it's not clear to me that you could ask for more information than that. I can think of no input for which the parser should *ever* throw an exception. Are you saying that even a random garbage would be parsed to a Message of some kind? No headers, a single unparsed body?.. As long as it contains no NULs or high-bit-set octets, and is separated into at least two parts, each less than 998 characters long, by a CRLF, yes, I would definitely expect that an otherwise randomly generated string would be parsed to a Message. This Message should not be sendable because RFC 5322 requires the presence of a From and a Date. However, if you were implementing a sendmail-compatible MTA or LDA, you might very well wish to accept such a thing on stdin, parse it to a Message, and then default the From and Date header fields appropriately, and add a Message-ID header field. I would, anyway, wouldn't you? Ah, yes, that's another use case, isn't it?! ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] fixing the current email module
Barry Warsaw writes: from email import message_from_string with open('/dev/urandom') as wire: ... data = wire.read(1024) ... # insert A msg = message_from_string(data) # number of headers ... len(msg) 0 len(msg.get_payload()) 1024 msg.defects [] This actually makes perfect sense. A message with no headers and a mass of 1024 bytes in its payload is RFC valid! If you insert at A wire = .join(chr(ord(ch) 127) for ch in wire) # optional with reasonably high probability: wire = wire[0:512] + \r\n + wire[512:1024] or similar. Otherwise not. ;-) ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] fixing the current email module
Barry Warsaw writes: On Oct 8, 2009, at 3:29 AM, Glenn Linderman wrote: Headers could possibly be a quadruple instead of a triple, with the 4th item being the wire format if received? I think the whole input format (note, not necessarily wire!) should be saved off on the top-level Message object (possibly in a file, per Barry's comments about that). Subobjects could then refer to to pieces of that as position ranges. I think not a quad. I think other APIs should be used to extract the raw data, e.g. # return a unicode or throw an exception text = str(header) # should always be okay even if gibberish raw = bytes(header) or /something/ like that. Does that work? I would think (especially in parallel to text) you want bytes(header) to be the wire format. If so, you want it to raise if it knows it contains gibberish. And again, we have the problem of whether it should return with the field name prepended or just the field body. I have a feeling we should not try to decide what APIs we're going to spell as __str__ and __bytes__ yet. ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] fixing the current email module
Barry Warsaw writes: Yeah, idempotency probably is not the right term, though I think historically that's what's been used. Math geeks, what's the right term here? :) Invertability *is* the math term. Roundtrip is more likely to make sense to real people. I completely agree with you (of course :). Other way around, I'm sure.wink What-about-the-curmudgeon-behind-the-curtain-ly y'rs, ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] fixing the current email module
Bill Janssen writes: I should point out that I also store lots of metadata in the registered MIME format text/rfc822-headers (defined in RFC 1892), data that doesn't necessarily conform to the specific set of headers mentioned in RFC822. It would be nice if the header support in the email package would also support reading and writing that format. I'm not sure what you're saying here. RFC 822 is inclusive. More or less, if it looks like a header, it is a header, and we need to parse it at least into field name and field body, whether RFC 822 defines more specific syntax for it or not. Is that all, or do you mean you want it to give that MIME format special treatment, such as a method for converting a Message object containing a parsed RFC 822 message to a Message object containing a multipart/report message and a text/rfc822-headers subobject, ready to have the text/plain and message/delivery-status parts filled in per RFC 1892? And MIME multipart is sometimes used in applications other than email. It would be nice if the MIME parsing part of the email module could be used for those purposes, as well -- basically without some of the headers defined in 2822 and 2821. Ditto, here. I would expect that you could feed an HTTP stream containing headers and content to the Message constructor and get something sensible back. Dunno what Barry thinks of that, though. ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] fixing the current email module
Glenn Linderman writes: Conversions will eventually be done. Best it were done quickly. Disagree. Deferring the conversions defers failure issues to the point where the code (hopefully) somewhat understands the type of data being manipulated, and can then handle it appropriately. Converting up front causes errors in things that may never be touched or needed, so the error detection and handling is wasteful. That's theory; my position is based on Mailman practice. Don't believe me, ask Barry. I also spend most of my OSS time on the internationalization of XEmacs, and the experience is similar there. Best to convert everything as early as possible, or admit that you don't know how. So for headers, which are supposed to be ASCII, or encoded via RFC rules to ASCII (no 8-bit chars), then the discovery of an 8-bit char should be produce a defect report, but then simply converted to Unicode as if it were Latin-1 (since there is no other knowledge available that could produce a better conversion). No, that is already corruption. Most clients will assume that string is valid as a header, because it's valid as a string. And if the result of that is not expected by the client (your definition), then the client should either notice the defect report and reject it based on that, or attempt to parse it, and reject it if it encounters unexpected syntax. After all, this is, for that client, raw user input (albeit from a remote source) so fully error checking the input is appropriate. No way. That environment would suck to program in. And it's un-Pythonic: Errors should never pass silently. Python way. Since the email library is trying to avoid raising exceptions in large blocks of its code, it is non-Pythonic I disagree with that. Unless explicitly silenced. The strategy that Barry and I favor is to signal errors lazily. So we *explicitly* silence errors (at least of the Exception kind) when parsing. If we can't parse, we look for a part terminator, encapsulate the bad stuff and move on to the rest of the input. Later, at use time, *if* the unparsable object is used, *then* the error will be raised, hopefully with enough metainformation to figure out what to do about it. I don't see what's un-Pythonic about that. ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] fixing the current email module
Glenn Linderman writes: If you mean that the email module will keep track of what form the object is currently represented by, that will eventually result in UnicodeError: octet out of range: 161, ascii. The above sentence does not communicate your meaning to me... or any meaning, actually. Can you explain? Yes, that Unicode error is one that took years for Mailman to work around. If we are going to be converting different objects at different times, I'm sure we'll get to see it agin in the future. Oh, joy. If conversions are avoided, then octets are unlikely to be out of range? Haven't looked in your spam bucket recently, I guess. Spammers regularly put 8 bit characters into headers (and into bodies in messages without a Content-Type header), for one thing. And the email module must be aware of the form of the data in order to manipulate it in any format other than wire format, but fortunately, wire format declares the format of the data (not to say there is not buggy wire format data -- but that is an issue best avoided by avoiding as many conversions as possible). Best I can't speak to; you obviously are willing to accept a much higher error rate than I am. Robust handling of buggy wire format data means that the email module must do something sane with it before giving it to the application. Maybe it's reasonable to do that lazily, and/or cache the result, but access to bogus data (that the email module can determine is bogus or suspicious) must not be allowed unless the client says hit me with your best shot explicitly. Most clients are simply not going to be prepared for the kind of crap I see in /var/mail/turnbull every day. I was pushing back from your declaration that an archiver would always want string output Please don't push back; we won't get anywhere. Use cases are *examples*, not complete specifications of all possible inputs and outputs. Use cases should be simple and clear cut. If you want a different use case, state it. In fact in the real world, *all* of the archivers I know of produce text formats on disk, either deleting multimedia objects or saving them off and linking to them via URLs in the text. If you know of a different kind of archiver, add it as a use case. ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] fixing the current email module
Glenn Linderman writes: Yes, I interpreted, possibly misinterpreted, Barry's comment about storing things as bytes, as that he was figuring to store them in wire format. What that means is unclear, though. Does a header in wire format mean before or after MIME encoding? Probably after, but that's pretty useless for the purpose of editing the header. Does it include the tag (the part before the colon) or not? Etc. I would tend to agree with that, except that if something is received/provided in a particular format, it might want to stay in that format until such time it is needed in a different format... and then the appropriate set of conversions (current format = internal format = needed format) applied as needed, avoiding all conversions when it is already in the needed format. If you mean that the email module will keep track of what form the object is currently represented by, that will eventually result in UnicodeError: octet out of range: 161, ascii. two conversions are slower than none, and use 2-4 times the space in string format. Let's get this correct, *then* optimize, please. One has to write the conversion code anyway; it is just a matter of where it is called. Once converted, meta data could be retained in its natural format. Meta data for what? Why would you convert meta data? 2. MUA #1: Composition. Input will be strings and multimedia file names, output will be bytes. Will attributes of message objects be manipulated? Not in a conventional MUA, but an email-based MUA might find uses for that. I'm not sure what an email-based MUA is seems to me even a conventional MUA is email-based??? Only if it's written using the Python email module. 4. Mailing list processor. Message input will be bytes. Configuration input, including heading and footer texts that may be added are likely to be strings. Header manipulation (adding topics, sequence numbers, RFC 2369 headers) most conveniently done with strings. Output will be bytes. But the bulk of the message parts, received in wire format, may not need to be altered to be sent along in the same wire format. That depends. For example, multimedia parts may simply be discarded, in which case it makes sense to not convert them. However, most Mailman lists do add a footer, and because of crappy Windows MUAs that don't implement MIME correctly, it's preferred to add that by concatenating as text. That simply cannot be done correctly in wire format for any character set except ISO 8859/1. Heading and footing texts are configured boilerplate, and could be cached in a variety of formats to avoid the need to convert them for each message, Premature optimization is the root of all error. An archiver could archive wire format, Are you suggesting that the email module should mandate that? We have a severe tail-dog inversion problem here. ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] Ensuring 7 bit encoding
R. David Murray writes: import email.message m = email.message.Message() m.set_payload(A few lines ... of 8-bit text ... ... One high bit character: ². ... , 'us-ascii') print m.as_string() MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit A few lines of 8-bit text One high bit character: ². Since 8bit isn't technically us-ascii, I wonder if this is a bug. This is a bug. ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] Generating zipped or gzipped attachment with emailpackage?
Mark Sapiro writes: Ideally, one would be able to specify a parameter on the Content-Type; header along the lines of Content-Type: text/csv; charset=utf-8; compression=gzip No, I think this is really a content transfer encoding, not part of Content-Type, and I don't see why one would be enough. Nor would it necessarily always be compression. So how about a Content-Transfer-Filter header which resolves to an (order-sensitive!) list of transformations: Content-Transfer-Filter: pgp-encrypted; algorithm=idea; order=3 Content-Transfer-Filter: x-xz; order=2; comment=the successor to LZMA; alternate-application=x-lzma Content-Transfer-Filter: base64; order=1 Order is decoding order here. Otherwise you'd need a parameter to determine which to use first (in case of corruption or reordering by some brain-damaged MUA or MTA). In the presence of a Content-Transfer-Encoding header, the Content-Transfer-Encoding should be applied first, then any Content-Transfer-Filters. ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] API for Header objects
Tony Nelson writes: This example seems tortured and contrived. Not at all. I currently use grep, not the email package, but in fact I extract several headers for use in mailing list moderation. It's getting to the point where my gradually accreting shell script doesn't cut it (more because I'm recruiting additional moderators than because I'm not happy with it), and if I'm going to do this in Python I definitely want an obvious and elegant way to produce a displayable string (ie, Unicode) because not all of the messages I get in Chinese and Korean are spam. Custom code to extract a single header one time to send to someone? That is precisely why we want a simple readable short elegant API. Like str(msg['To']). This also suggests the sequence interface of msg['To'] should not contain tuples of strings, but rather NameAddr objects (taken from the RFC 5322 grammar). Then to flatten a NameAddr, use str or bytes as appropriate. So to present a list of addressees in a moderation interface, you could use recips = list(msg['To']) + list(msg['Cc']) # We have a utf-8 codec on stdout, between us and the wire. print(ul\n) for recip in recips: print( li) print(htmlesc(str(recip))) print(/li\n) print(/ul\n) Of course for wire protocol, you just use bytes instead of str. Hey! that's not bad, even if I do say so myself. Just hit reply and trim it yourself. That won't work, for several reasons. If you must, you can use .get_header('X-Spam-Evidence').flatten(). I doubt that anyone would actually do that, outside of a debugging session. sigh / I do it. No. This is important, and you will not understand RFC x822 email until you understand this: email messages are not character strings. They are byte sequences. This confusion pervades the email package only because in Python before 3.x, bytes were represented as strings. That's a bit generous and ungenerous at the same time. The people who worked on email were trying to come up with a reasonable interface that on the one side treated wire format as bytes (Python 1.x, 2.x str) and display format as text (Python 1.x str, oops, Python 2.x unicode). They failed, unfortunately, but not really because the tools were unavailable. They just treated the difficulties with insufficient respect. On the other hand, these difficulties are inherent in the medium. People (by which I mean nobody participating in this thread) think of email as text. MTAs think of email as octet sequences. Developers (especially Americans) have been sloppy about that distinction for *five* decades, and because until 2000 at least email was the sine qua non of networking, backward compatibility has long demanded incorporating all those mistakes in current practice. And now you're doing the same thing. Email messages have at *least* four ways of manifesting in our world that email-sig needs to worry about: as byte sequences on the wire, as (mostly, anyway, and certainly the headers) texts in our MUAs, as whatever-they-really-are, and as the internal representation of the email package. So depending on which side of the argument you feel like taking, you insist (inconsistently) that an email is a byte string or a header is not a string at all, it's a structured thingie. But it's not that easy. What we need to do is come up with an API that respects all of those aspects *simultaneously*, and allows us to elegantly but accurately change the perspective we use to view this whatever-it-really-is. No, email is not text. Email message bodies and some header fields may represent text. An email message is a byte sequence. One really needs to understand this in order to work with email at a low level. Hm. And here I was hoping that the email package would *implement* the low level, leaving me free to think about high-level things. When one does not understand, then the email package should lead the user in the right direction. No, thank you. Python is a double-opt-in language. We're all consenting adults here. Programmers who don't understand the RFCs are likely to be surprised in many places, but they asked for it, they got it. ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] API for Header objects [was: Dropping bytes support in json]
R. David Murray writes: Note that while I want to be able to do str(someHeader) to get a string representation of a header body, I'm not so enamored of being able to do message['From'] = 'John Smith j...@foo.com' and have it get turned into a Header or AddressHeader object. Frankly, that looks too magical to me. +1 Well, that would make it easy to write scripts that parse lists of addresses and do things with them. Eg, a mailing list manager's mass subscribe interface. That would be nice ... but on reflection it's clear that we would want that to be parsed *strictly*. So it raises exceptions, which must be caught and handled, etc etc. In other words, it's actually not so easy to write scripts, no matter what you do, and you also want to be able to specify what kind of magical fixups (the ever-popular display-name with unquoted period immediately comes to mind as one example) are acceptable, and which are not, not to mention encoding for non-ASCII text. How about unstructured header bodies, like Subject? Should we allow it, for convenience, or not, for consistency? How about unknown fields, eg X-Are-We-Not-Structured-No-We-Are-Devo? I think, in the first draft, we should be *consistent* in both cases. ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] API for Header objects [was: Dropping bytes support in json]
Tony Nelson writes: No. The useful data for an address field is *properly* a list of pairs of friendly name, address -- you should read RFC 5322 section 3.4. The fact that you think I didn't suggests there's really no point in continuing to talk to you. But I'll give it another try. The issues we are dealing with at this point really have very little to do with accurate implementation of the RFCs. We all know that's necessary, but ... it's a Simple Matter Of Programming. At least, that's why Postel, Crocker, et al put so much effort into writing the RFCs, so it would be a SMOP. I think they did a pretty good job. I agree with you that we should make it relatively difficult to put things that *don't* conform to the RFCs on the wire. But that should be the responsibility of the middleware that talks to the file system and to the MTA. I see no reason *at this stage* to burden MUA (in the general sense) developers with all the RFC rules, and MDA/MTA writers should only need to worry about it for error handling (__bytes__() should normally do the job for them). (For values of should equivalent to in my dreams, I do fear.) This makes it very important that the easy way of doing things be the correct way. With Address fields, that way is Nonsense. You are ignoring the fact that *people* (ie, nobody participating in this threadwink) read an address field *as text*, and they type in addresses *as text*. We do not extract and inject this information as pickles of Header objects via Firewire sockets implanted in their skulls. There is *no /unique/ correct way* here. For example (this is a trick question), in your opinion, what should msg['To'][0] return if the original header was To: Stephen J. Turnbull step...@xemacs.org ? ('Stephen J. Turnbull', 'step...@xemacs.org') You must be very confused to think this is a trick question. Try it with the current email package's email.utils.parseaddr(). Again, see RFC5322 section 3.4. But section 3.4 is not relevant to the trickiness, and parseaddr is not strictly conforming. See the definitions of name-addr, display-name, phrase, word, atom, and atext in sections 3.2.3, 3.2.5, and 3.4 of the RFC you cite. Also see the definition of special. Finally, I commend to your attention the definition of obs-phrase in section 4.1, and the *very* special nature of this particular gotcha as described there. The point is that by parsing that and claiming it's an RFC 5322 section 3.4 name-addr, you have invoked the rather magical Postel Principle. You either have to say for my purpose I want magic in the API (which you previously denied), or you have to admit that this is harder than it looks. It is true that section 4.1 says that the obsolete (interpreting) syntax must be accepted *off the wire*. So there certainly is a justification for having a short obvious elegant spelling for make an address Header into a sequence. But IMHO that spelling should be list(msg['To']), not msg['To']. The rationale is that---assuming it can be implemented---several of us would like to be able to spell wire format as bytes(msg['To']) and display format as str(msg['To']). I bet there are other uses that would be well-served by such indirection. And I would be disappointed if we can't do way better than msg.get_header('To').flatten() to get bytes---or should that be string?---out. Internally, the Header whose .useful attribute is returned by msg['foo'] will contain parsed data, referring to parsed tokens. Flattening those parsed tokens will produce the original data. Not a problem at all, simple to implement, in the most direct way. And horrid to use, if you mean that the internal representation will be a full parse tree according to the augmented BNF in RFCs 822, 2822, 5322, 2045-2049, etc etc., and that the only other way to access that data is via an arbitrarily defined .useful attribute (which, BTW, is quite unpythonic if you intend for it to be available as msg['foo'] as well: TOOWTDI). You put words in my mouth. Of course I don't put words in your mouth. The phrase if you mean that clearly indicates that what follows is *my* understanding of the implications of what you wrote. I think that interpretation is quite justifiable based on your insistence that the OOWTDI be your sequence of (address, display-name) pairs. Wny assume that I am incompetent, or a fool? I don't assume any such thing. But I become less and less trustful of your goodwill toward requirements other than your own. Of course the internal representation would include the full parse tree. Of course the external interface would provide read and write access to the relevent data. Note that I didn't say it wouldn't. I said it *would*. But I think it's justified, by what you have written so far, to expect that it would be an inconvenient interface (maybe even horridly so). The .useful attribute (need
Re: [Email-SIG] API for Header objects [was: Dropping bytes support in json]
R. David Murray writes: put Header objects into it. I don't think the overhead of having to do message['Subject'] = Header('subject string') Hm. Should a Header know which header it is? Ie, should that be message['Subject'] = Header('subject', 'subject string') ? (I assume you would be less than in love with having the assignment magically stuffing Subject into the Header as it gets assigned.) ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] API for Header objects [was: Dropping bytes support in json]
Tony Nelson writes: Assuming that by Destination you mean a class of Address header fields, as there is no Destionation: header field, such header fields contain addresses, which can be considered to contain (as the email package does) a list of (name, email address) pairs, or, at a lower level, to also have Comments, there is indeed only one correct choice, which is the one the email package currently provides the diligent user. I wish it to be the one obvious choice, so that less study is needed to properly use the email package. As you point out above, display names and comments are different. It's *not* obvious to me that they should be confounded by default. In any case, it would certainly be possible to implement both the indexing feature, so that msg['To'][0] returns a (display, mailbox) tuple, and a converter so that list(msg['to']) returns a list of such tuples (in both cases, assuming that most users prefer not to distinguish comments from display names). ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] [Python-Dev] Dropping bytes support in json
Shouldn't this thread move lock stock and .signature to email-sig? Barry Warsaw writes: It does seem to make sense to think about headers as text header names and text header values. I disagree. IMHO, structured header types should have object values, and something like While I agree, there's still a need for a higher level API that make it easy to do the simple things. Sure. I'm suggesting that the way to determine whether something is simple or not is by whether it falls out naturally from correct structure. Ie, no operations that only a Cirque du Soleil juggler can perform are allowed. I agree that the Message class needs to be strict. A parser needs to be lenient; Not always. The Postel Principle only applies to stuph coming in off the wire. But we're *also* going to be parsing pseudo-email components that are being handed to us by applications (eg, the perennial control-character-in-the-unremovable-address Mailman bug). Our parser should Just Say No to that crap. see the .defects attribute introduced in the current email package. Oh, and this reminds me that we still haven't talked about idempotency. That's an important principle in the current email package, but do we need to give up on that? Idempotency? I'm not sure what that means in the context of the email package ... multiplication by zero?wink Do you mean that .parse().to_wire() should be idempotent? Yes, I think that's a good idea, and it shouldn't be too hard to implement by (optionally?) caching the whole original message or individual components (headers with all whitespace including folding cached verbatim, etc). I think caching has to be done, since stuff like did the original fold with a leading tab or a leading space, and at what column and so on seems kind of pointless to encode as attributes on Header objects. [Description of MessageTextView and MessageWireView elided.] This seems similar to Glyph's basic idea, but with a different spelling. Yes. I don't much care which way it's done, and Glyph's style of spelling is more explicit. But I was thinking in terms of the number of people who are surely going to sing Mama don' 'low no Unicodes roun' here and squeal codec WTF?! outta mah face, man! ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] [Python-Dev] the email module, text, and bytes (was Re: Dropping bytes support in json)
Bill Janssen writes: Barry Warsaw ba...@python.org wrote: In that case, we really need the bytes-in-bytes-out-bytes-in-the-chewy- center API first, and build things on top of that. Yep. Uh, I hate to rain on a parade, but isn't that how we arrived at the *current* email package? ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] [Python-Dev] Dropping bytes support in json
Barry Warsaw writes: There are really two ways to look at an email message. It's either an unstructured blob of bytes, or it's a structured tree of objects. Indeed! Those objects have headers and payload. The payload can be of any type, though I think it generally breaks down into strings for text/ * types and bytes for anything else (not counting multiparts). *sigh* Why are you back-tracking? The payload should be of an appropriate *object* type. Atomic object types will have their content stored as string or bytes [nb I use Python 3 terminology throughout]. Composite types (multipart/*) won't need string or bytes attributes AFAICS. Start by implementing the application/octet-stream and text/plain;charset=utf-8 object types, of course. It does seem to make sense to think about headers as text header names and text header values. I disagree. IMHO, structured header types should have object values, and something like message['to'] = Barry 'da FLUFL' Warsaw ba...@python.org should be smart enough to detect that it's a string and attempt to (flexibly) parse it into a fullname and a mailbox adding escapes, etc. Whether these should be structured objects or they can be strings or bytes, I'm not sure (probably bytes, not strings, though -- see next exampl). OTOH message['to'] = b'''Barry 'da.FLUFL' Warsaw ba...@python.org''' should assume that the client knows what they are doing, and should parse it strictly (and I mean be a real bastard, eg, raise an exception on any non-ASCII octet), merely dividing it into fullname and mailbox, and caching the bytes for later insertion in a wire-format message. In that case, I think you want the values as unicodes, and probably the headers as unicodes containing only ASCII. So your table would be strings in both cases. OTOH, maybe your application cares about the raw underlying encoded data, in which case the header names are probably still strings of ASCII-ish unicodes and the values are bytes. It's this distinction (and I think the competing use cases) that make a true Python 3.x API for email more complicated. I don't see why you can't have the email API be specific, with message['to'] always returning a structured_header object (or maybe even more specifically an address_header object), and methods like message['to'].build_header_as_text() which returns To: Barry 'da.FLUFL' Warsaw ba...@python.org and message['to'].build_header_in_wire_format() which returns bTo: Barry 'da.FLUFL' Warsaw ba...@python.org Then have email.textview.Message and email.wireview.Message which provide a simple interface where message['to'] would invoke .build_header_as_text() and .build_header_in_wire_format() respectively. Thinking about this stuff makes me nostalgic for the sloppy happy days of Python 2.x Er, yeah. Nostalgic-for-the-BITNET-days-where-everything-was-Just-EBCDIC-ly y'rs, ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com
Re: [Email-SIG] email.header.decode_header eats my spaces
Barry Warsaw writes: Steve writes: IMHO, the Header class should be abstract, and there should be subclasses that handle dates, lists of addresses, lists of message-ids, etc. I'm not sure inheritance is the right way to organize this. I picked inheritance because I see the header type as being fixed at Header instantiation (I can't think of a use-case for changing a From header to a Subject header, while Message-ID and Resent-Message-ID would be handled by the same class), but there are some things (handling folding, parsing the field name and body) that are common to all headers. I would be happy with any scheme that has the property that given a field name, the semantics of its contents are fixed according to the field if it is registered, or treated as *text with caution (maybe extra warnings? etc) if the field is not registered. Or, maybe inheritance is right. In any case, I think you also want to also have a registry of some sort Indeed I do! ___ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com