On approximately 10/8/2009 6:00 AM, came the following characters from the keyboard of Barry Warsaw:
On Oct 8, 2009, at 3:29 AM, Glenn Linderman wrote:
The application options are to drop the attachment, or pass through the corrupted bytes, and let the next application try to make sense of it.

Exactly, and it's not for the email package to say which is right.

Here's a use case: I've got a Message that was parsed from wire input and I want to mangle the Subject heading to add the list prefix. I know exactly what charset the prefix is in because that's data I control. When I ask for the original Subject value, I'm handed an instance that I can use to try to figure out how add the prefix.

First thing I'll ask it is "are you a single chunk in my prefix charset (or compatible)?" If so, I can probably just prepend my prefix onto the value. If not, "are you composed of multiple valid chunks in different charsets?" If so, I know that I need to encode my prefix, but I can still prepend it to the header value (hopefully using the same API, and I don't care that the implementation could not use string concatenation).

If not, then what? Maybe I don't care if some of the chunk charsets aren't known because I can still use the right encode+prepend strategy. But if the header is a gobbledegook of 8-bit bytes? I'm pretty sure I want to be able to ask the API if that's the case rather than get an exception. The thing I'm not so sure about is what happens if my application is just naive enough to just ask for the header as a unicode and that conversion can't be made. I /think/ it should raise an exception in that case. But then when I ask for the header value as a mass of bytes, that should succeed and return me the raw input.

So for this use case, it is known that all headers are ASCII. So the operation of prepending a list prefix should not care whether the Subject: value is valid or not... it can simply prepend the list prefix, followed by SP, to the existing, raw header that already exists.

The only remaining issue is line length limits, so maybe it has to use CR LF TAB instead of space, sometimes.

OK, so if the prefix is not ASCII, it gets separately encoded, including a trailing SP, and then prepended to the value followed by SP or CR LF TAB depending on the line length limit.

So to prepend into a text header, you shouldn't need to decode the undecodable... there should be a prepend (and possibly also an append) operation provided by the API, so that applications can tweak headers without decoding. This allows useful behavior even if new methods of encoding are invented that are not yet understood by a particular version of the email library.

Asking for the header value (or whole header) in Unicode should decode the chunks that are understandable and decodable, and leave the chunks that are not understandable as ASCII-converted-to-Unicode-but-still-possibly-weirdly-encoded ... I think that is what the RFCs encourage.

Asking for a header as bytes should return the wire data, if it is available, or an encoding of real data as wire data (like generate would do). There is no Unicode that cannot be encoded to wire format, IIUC, usually via a variety of heuristics once non-ASCII characters are included, that may produce a variety of differing results, all of which should decode back to the original data.

--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

_______________________________________________
Email-SIG mailing list
[email protected]
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Reply via email to