Antoine Pitrou writes: > You could also point out UTF-16 or EBCDIC, but I fail to see how that's > relevant. Do you have problems with ISO 2022 when parsing, say, e-mail > headers?
Yes, of course! Especially when it's say, packed EUC not encapsulated in MIME words. I think Mailman now handles that without crashing, but it took 10 years. Most Emacs MUAs still blow chunks on that. My procmail recipes and my employer's virus checker both occasionally punt. The point about ISO 2022 is that it allows arbitrary binary crap in the stream, delimited by appropriate well-defined constructs. Just like the ASCII-like tokens in the protocols you talk about. But parsing full-bore ISO 2022 is non-trivial, especially if you're going to try to provide error-handling that's useful to the user. Nobody ever really took it seriously as a solution to the problem of internationalization in the 15 years or so when it was the only solution, and even less so once it became clear that UCSes were going to get traction. > > > not arbitrary "arrays of bytes". And making indexing of bytes > > > objects return ints was IMHO a mistake. > > > > Bytes objects are not ASCII strings, even though they can be used to > > represent them. > > I'm talking about practice, So am I, and so is Nick. > not some idealistic view of the world. > In many use cases (XML, HTML, e-mail headers, many other test-based > protocols), you can get a mixture of ASCII "commands", and opaque > binary stuff (which will or will not, depending on these "commands", > have a meaningful unicode decoding). Yeah, so what? Those protocol tokens are deliberately chosen to resemble ASCII text, but you need to parse them out of the binary sludge somehow, and the surrounding content remains binary sludge until deserialized or (for text) decoded. How is having b[0] return a bytes object, rather than an integer, going to help in that? Especially if the value is not in the ASCII range? > > AFAICS, anything that should be done with ASCII-punned magic numbers > > ("protocol tokens", if you prefer) can be done with slices and (ta-da!) > > case conversion. > > So, basically, you're saying that we should remove useful functionality No, that *was* Nick's position; I specifically opposed the suggestion that "lower" and "upper" be removed, and he concurred after a bit of thought. And remember, he's talking about removing "swapcase". Which RFC defines a protocol where that would be useful? How about "title"? > and tell people to reimplement an adhoc version of it when they > need it. Of course not; I'm with Michael Foord on that: nobody should ever be asked to reimplement swapcase! My position is simply that bytes are not text, and the occasional reminder (such as b[0] returning an integer, not a bytes object) is good. My experience has been that it makes a lot of sense to layer these things, for example transforming a protocol stream serialized as octets into a more structured object composed of protocol tokens and payloads. It's *not* text, and the relevant techniques are different. It's like the old saw about "aha, I'll use regexps to solve this problem!" and now you have *two* problems. I don't advocate getting rid of regexps, and I don't advocate removing methods from bytes (although I do dream about it occasionally). I do advocate that people think twice before implementing complex text-like algorithms on binary protocol streams. If the stream really is text-like, then transform it into text of a known, well-behaved encoding, and then apply the powerful text-processing facilities provided for str. If it's not, then transform to a token stream or whatever makes sense. In both cases, do as little "text processing" on bytes objects as possible, and put more structure on the content as soon as possible. If you really need the efficiency, then do what you need to do. As I say, I don't have any practical objection to keeping your tools for that case. But such applications, although important (I guess), are a minority. > That sounds obnoxious. Good advice almost always sounds obnoxious to the recipient. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com