On Mon, 25 Jan 2010 16:55:15 -0800, Glenn Linderman <v+pyt...@g.nevcal.com> wrote: > On approximately 1/25/2010 12:10 PM, came the following characters from > the keyboard of R. David Murray: > > So, those are my thoughts, and I'm sure I haven't thought of all the > > corner cases. The biggest question is, does it seem like this general > > scheme is worth pursuing? > > If it was stated, I missed it: is from_full_header a way of producing > an object from a raw data value? Whereas __init__ would obviously be
Yes. > used to produce one from string or bytes values. If so, then it would Well, StringHeader.from_full_header would take a string as input, while BytesHeader.from_full_headerwould take bytes as input. __init__ would be used to construct a header in your program: StringHeader('MyHeader', 'my value') BytesHeader(b'MyHeader', b'my value'). > be a requirement that this from_full_header API would never produce an > exception? Rather it would produce an object with or without defects? Yes. > Are there any other *Header APIs that would be required not to produce > exceptions? I don't yet perceive any. I don't think so. from_full_header is the only one involved in parsing raw data. Whether __init__ throws errors or records defects is an open question, but I lean toward it throwing errors. The reason there is an open question is because an email manipulating application may want to convert to text to process an incoming message, and there are things that a BytesHeader can hold that would cause errors when encoded to a StringHeader (specifically, 8 bit bytes that aren't transfer encoded). So it may be that decode, at least, should not throw errors but instead record additional defects in the resulting StringHeader. I think that even in that case __init__ should still throw errors, though; decode could deal with the defects before calling StringHeader.__init__, or (more likely) catch the errors throw by __init__, fix/record the defects, and call it again. Note, by the way, that by 'raw data' I mean what you are feeding in. Raw data fed to a BytesHeader would be bytes, but raw data fed to a StringHeader would be text (eg: if read from a file in text mode). > The "charset" parameter... is that not mostly needed for data parts? No, if you start with a unicode string in a StringHeader, you need to know what charset to encode the unicode to and therefore to specify as the charset in the RFC 2047 encoded words. > Headers are either ASCII, or contain self-describing charset info. That's true for BytesHeaders, but not for StringHeaders. So as I said above charset for StringHeader says which charset to put into the encoded words when converting to BytesHeader form. I specified a charset parameter for 'decode' only to handle the case of raw bytes data that contains 8 bit data that is not in encoded words (ie: is not RFC compliant). I am visualizing this as satisfying a use case where you have non-email (non RFC compliant) data where you allow 8 bit data in the header bodies because it's in internal ap and you know the encoding. You can then use decode(charset) to decode those BytesHeaders into StringHeaders. > I guess I could see an intermediate decode from string to some charset, > before serialization, as a hint that when generating headers, that all > the characters in the header that are not ASCII are in the specified > charset... and that that charset is the one to be used in the > self-describing serialized ASCII stream? The full generality of the Exactly. > RFCs, however, > allows pieces of headers to be encoded using different charsets... with > this API, it would seem that that could only be created containing one > charset... the serialization primitives were made available, so that > piecewise construction of a header value could be done with different > charsets, and then the from_full_header API used to create the complex > value. I don't see this as a severe limitation, I just want to > understand your intention, and document the limitation, or my > misunderstanding. Right. I'm visualizing the "normal case" being encoding a StringHeader using the default utf-8 charset or another specified charset, turning the words containing non-ASCII characters into encoded words using that charset. The utility methods that turn unicode into encoded words would be exposed, and an application that needs to create a header with mixed charsets can use those utilities to build RFC compliant bytes data and pass that to one of the BytesHeader constructors. (Make the common case easy, and the complicated cases possible.) > > BytesHeader would be exactly the same, with the exception of the signature > > for serialize and the fact that it has a 'decode' method rather than an > > 'encode' method. Serialize would be different only in the fact that > > it would have an additional keyword parameter, must_be_7bit=True. > > I am not clear on why StringHeader's serialize would not need the > must_be_7bit parameter... or do I misunderstand that > StringHeader.serialize produces wire-format data? The latter. StringHeader serialize does not produce wire-format data, it produces text (for example, for display to the user). If you want wire format, you encode the StringHeader and use the resulting BytesHeader serialize. > > The magic of this approach is in those encode/decode methods. > > > > Encoding a StringHeader would yield a BytesHeader containing the same > > data, but encoded per RFC2047 using the specified charset. Decoding a > > BytesHeader would yield a StringHeader with the same data, but decoded to > > unicode per RFC2047, with any 8bit parts decoded (in the unicode sense, > > not the RFC2047 sense) using the specified charset (which would default to > > ASCII, meaning bare 8bit bytes in headers would throw an error). (What to > > with RFC2047 charsets like unknown-8bit is an open question...probably > > throw an error). > > > > Would the encoding to/from StringHeader/BytesHeader preserve the > from_full_header state and value? My thought is no. Once you encode/decode the header, your program has transformed it, and I think it is better to treat the original raw data as gone. The motivation for this is that the 'raw data' of a StringHeader is the *text* string used to create it. Keeping a bytes string 'raw data' around as well would get us back into the mess that I developed this approach to avoid, where we'd need to specify carefully the difference between handing a header whose 'original' raw data was bytes vs string, for each of the BytesHeader and StringHeader cases. Better, I think, to put the (small) burden on the application programmer: if you want to preserve the original input data, do so by keeping the original object around. Once you mutate the object model, the original raw data for the mutated piece is gone. There are some use-case questions here, though, with regards to preservation of as much original information/format as possible, and how valuable that is. I think we'll have to figure that out by examining concrete use cases in detail. (It is not something that the current email package supports very well, by the way...headers currently get modified significantly in the parse/generate cycle, even without bytes-to-string transformations happening.) > > (Encoding or decoding a Message would cause the Message to recursively > > encode or decode its subparts. This means you are making a complete > > new copy of the Message in memory. If you don't want to do that you > > can walk the Message and convert it piece by piece (we could provide a > > generator that does this).) > > Walking it piece by piece would allow the old pieces to be discarded, to > save total memory consumption, where that is appropriate. > > Perhaps one generator that would be commonly used, would be to convert > headers only, and leave MIME data parts alone, accessing and converting > them only with the registered methods? This would mean that a "complete > copy" wouldn't generally be very big, if the data parts were excluded > from implicit conversion. Perhaps the "external storage protocol" might > also only be defined for MIME data parts, and walking the tree with this > generator would not need to reference the MIME data parts, nor bring > them in from "external storage". That's true. The Bytes and String versions of binary MIME parts, which are likely to be the large ones, will probably have a common representation for the payload, and could potentially point to the same object. That breaking of of the expectation that 'encode' and 'decode' return new objects (in analogy to how encode and decode of strings/bytes works) might not be a good thing, though. In any case, text MIME parts have the same bytes vs string issues as headers do, and should, IMO, be converted from one to the other on encode/decode. Another possible approach would be some sort of 'encode/decode on demand' system, although that would need to retain a pointer to the original object, which might get us into suboptimal reference cycle difficulties. These bits are implementation details, though, and don't affect the API design. > > raw_header would be the data passed in to the constructor if > > from_full_header is used, and None otherwise. If encode/decode call > > the regular constructor, then this attribute would also act as a flag > > as to whether or not the header was constructed from raw input data > > or via program. > > > > This _implies_ that from_full_header always accepts raw data bytes... > even for the StringHeader. And that implies the need for an implicit > decode, and therefore, perhaps a charset parameter? No, not a charset > parameter, since they are explicitly contained in the header values. Your confusion was my confusing use of the term 'raw data' to mean whatever was input to the from_full_header constructor, which is bytes for a BytesHeader and text for a StringHeade. > Decode for header values may not need a charset value at all! Normally it would not. charset would be useful in decode only for non-RFC compliant headers. > No comments for the rest. Thanks for your feedback. --David _______________________________________________ Email-SIG mailing list Email-SIG@python.org Your options: http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com