On approximately 10/9/2009 6:25 PM, came the following characters from the keyboard of R. David Murray:
On Fri, 9 Oct 2009 at 17:54, Glenn Linderman wrote:
On approximately 10/9/2009 4:20 PM, came the following characters from the keyboard of R. David Murray:
 On Fri, 9 Oct 2009 at 13:26, Glenn Linderman wrote:
> On approximately 10/9/2009 8:10 AM, came the following characters from > the keyboard of Stephen J. Turnbull:
> >   Glenn Linderman writes:
> > > > > produce a defect report, but then simply converted to Unicode > > as if > > > it were Latin-1 (since there is no other knowledge > > available that > > > could produce a better conversion). > > > > > > No, that is already corruption. Most clients will assume > > that string
> > > >   is valid as a header, because it's valid as a string.
> > > > Sure it is corruption. That's why there is a defect report. But > > > the conversion technique is appropriate, per the Postel principle. > > > > Actually, I would say you are emitting leniently, in violation of the > > Postel principle. > > You can say that, but I don't have to believe it. I'm talking about > accepting; the message has arrived, it is here, the client is trying to > look at it, and I'm talking about ways the client can look at > not-quite-perfect data, knowing that it is not quite perfect, but still > being able to see it. I'm not at all talking about emitting data. You > seem to be calling the email package helping the client to accept > not-quite-perfect data, as a form of emitting data. It is not.

 IMO, the appropriate way for the email package to provide the API you
are talking about is it provide the client with a way to get at the raw
 byte string, which I think everyone agrees on.  If the client wants to
decode it as if it were latin-1 to process it, it can then do that.

That certainly works, but it isn't very helpful... that forces the client application to reproduce the logic to parse the header value and decode the parts that can be decoded successfully, and that is exactly the sort of thing Stephen was complaining about when he thought I was suggesting that to be a requirement (but he was confused about what I was suggesting).

I wasn't clear, sorry :). The current API has a 'decode_header' function,
which doesn't do the byte-to-unicode decode (yeah, there's another naming
problem here...we have two types of decoding and only one word for both)
but instead returns (bytes, charset) tuples.  This piece of the API is
broken in python3, and I don't think it is the right API going forward,
but that _kind_ of API is what I meant by 'getting at the raw byte
string':  the byte string that failed the bytes-to-unicode decoding,
not the entire header (though there will also be a way to get that if
you need it, I presume.)

Yeah, that'd be better. Of course, when returning Unicode strings, there would be no particular need to identify the various charsets in which the header was transmitted, except for invertibility and error handling, unless the client wanted to track that for some reason. If the goal is to preserve invertibility, then maybe tuples like (str, charset, defect) would be better.... where defect would be None for good data, but if defect were "non-ASCII", then you'd know the str was converted as if it were charset [Latin-1 in my book, but if email package had rules or the API had parameters for how to deal with non-ASCII stuff, some other charset could be specified, perhaps, but if that fails it might still have to fall back to Latin-1]; if defect were "ASCII", then you'd know that the str looked like an encoded word, but couldn't be decoded because the charset wasn't recognized, or the decoding via that charset failed, so the encoded word was supplied.

Correspondingly, a header value could be set by supplying such a list, even with defect values as described above, to permit invertibility, and passing on what was obtained, so that if there are overriding local conventions (yep, such things used to be used, and maybe still are in some areas), that the data would be preserved as best as possible, and so that the email package could support creation of messages according to the local conventions.

I'd hope that a separate tuple would be used for each encoded-word, or, if charset ASCII and defect None, then it would describe a run of ASCII between encoded words. Yes, an encoded word can be encoded in ASCII for rare use (if the input word looks like an encoded word), so that would cause a sequence of charset ASCII, defect None tuples, but otherwise a plain ASCII header value would have a single entry in the list of tuples.

--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

_______________________________________________
Email-SIG mailing list
[email protected]
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Reply via email to