On approximately 10/9/2009 6:25 PM, came the following characters from
the keyboard of R. David Murray:
On Fri, 9 Oct 2009 at 17:54, Glenn Linderman wrote:
On approximately 10/9/2009 4:20 PM, came the following characters
from the keyboard of R. David Murray:
On Fri, 9 Oct 2009 at 13:26, Glenn Linderman wrote:
> On approximately 10/9/2009 8:10 AM, came the following characters
from > the keyboard of Stephen J. Turnbull:
> > Glenn Linderman writes:
> > > > > produce a defect report, but then simply converted to
Unicode > > as if > > > it were Latin-1 (since there is no other
knowledge > > available that > > > could produce a better
conversion).
> > > > > > No, that is already corruption. Most clients will
assume > > that string
> > > > is valid as a header, because it's valid as a string.
> > > > Sure it is corruption. That's why there is a defect
report. But
> > > the conversion technique is appropriate, per the Postel
principle.
> > > > Actually, I would say you are emitting leniently, in
violation of the
> > Postel principle. > > You can say that, but I don't have to
believe it. I'm talking about > accepting; the message has
arrived, it is here, the client is trying to > look at it, and I'm
talking about ways the client can look at > not-quite-perfect data,
knowing that it is not quite perfect, but still > being able to see
it. I'm not at all talking about emitting data. You > seem to be
calling the email package helping the client to accept >
not-quite-perfect data, as a form of emitting data. It is not.
IMO, the appropriate way for the email package to provide the API you
are talking about is it provide the client with a way to get at the
raw
byte string, which I think everyone agrees on. If the client wants to
decode it as if it were latin-1 to process it, it can then do that.
That certainly works, but it isn't very helpful... that forces the
client application to reproduce the logic to parse the header value
and decode the parts that can be decoded successfully, and that is
exactly the sort of thing Stephen was complaining about when he
thought I was suggesting that to be a requirement (but he was
confused about what I was suggesting).
I wasn't clear, sorry :). The current API has a 'decode_header'
function,
which doesn't do the byte-to-unicode decode (yeah, there's another naming
problem here...we have two types of decoding and only one word for both)
but instead returns (bytes, charset) tuples. This piece of the API is
broken in python3, and I don't think it is the right API going forward,
but that _kind_ of API is what I meant by 'getting at the raw byte
string': the byte string that failed the bytes-to-unicode decoding,
not the entire header (though there will also be a way to get that if
you need it, I presume.)
Yeah, that'd be better.
Of course, when returning Unicode strings, there would be no particular
need to identify the various charsets in which the header was
transmitted, except for invertibility and error handling, unless the
client wanted to track that for some reason.
If the goal is to preserve invertibility, then maybe tuples like (str,
charset, defect) would be better.... where defect would be None for good
data, but if defect were "non-ASCII", then you'd know the str was
converted as if it were charset [Latin-1 in my book, but if email
package had rules or the API had parameters for how to deal with
non-ASCII stuff, some other charset could be specified, perhaps, but if
that fails it might still have to fall back to Latin-1]; if defect were
"ASCII", then you'd know that the str looked like an encoded word, but
couldn't be decoded because the charset wasn't recognized, or the
decoding via that charset failed, so the encoded word was supplied.
Correspondingly, a header value could be set by supplying such a list,
even with defect values as described above, to permit invertibility, and
passing on what was obtained, so that if there are overriding local
conventions (yep, such things used to be used, and maybe still are in
some areas), that the data would be preserved as best as possible, and
so that the email package could support creation of messages according
to the local conventions.
I'd hope that a separate tuple would be used for each encoded-word, or,
if charset ASCII and defect None, then it would describe a run of ASCII
between encoded words. Yes, an encoded word can be encoded in ASCII for
rare use (if the input word looks like an encoded word), so that would
cause a sequence of charset ASCII, defect None tuples, but otherwise a
plain ASCII header value would have a single entry in the list of tuples.
--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
_______________________________________________
Email-SIG mailing list
[email protected]
Your options:
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com