At 04:29 PM 2/13/2006 -0800, Guido van Rossum wrote: >On 2/13/06, Phillip J. Eby <[EMAIL PROTECTED]> wrote: > > I didn't mean that it was the only purpose. In Python 2.x, practical code > > has to sometimes deal with "string-like" objects. That is, code that takes > > either strings or unicode. If such code calls bytes(), it's going to want > > to include an encoding so that unicode conversions won't fail. > >That sounds like a rather hypothetical example. Have you thought it >through? Presumably code that accepts both str and unicode either >doesn't care about encodings, but simply returns objects of the same >type as the arguments -- and then it's unlikely to want to convert the >arguments to bytes; or it *does* care about encodings, and then it >probably already has to special-case str vs. unicode because it has to >control how str objects are interpreted.
Actually, it's the other way around. Code that wants to output uninterpreted bytes right now and accepts either strings or Unicode has to special-case *unicode* -- not str, because str is the only "bytes type" we currently have. This creates an interesting issue in WSGI for Jython, which of course only has one (unicode-based) string type now. Since there's no bytes type in Python in general, the only solution we could come up with was to treat such strings as latin-1: http://www.python.org/peps/pep-0333.html#unicode-issues This is why I'm biased towards latin-1 encoding of unicode to bytes; it's "the same thing" as an uninterpreted string of bytes. I think the difference in our viewpoints is that you're still thinking "string" thoughts, whereas I'm thinking "byte" thoughts. Bytes are just bytes; they don't *have* an encoding. So, if you think of "converting a string to bytes" as meaning "create an array of numerals corresponding to the characters in the string", then this leads to a uniform result whether the characters are in a str or a unicode object. In other words, to me, bytes(str_or_unicode) should be treated as: bytes(map(ord, str_or_unicode)) In other words, without an encoding, bytes() should simply treat str and unicode objects *as if they were a sequence of integers*, and produce an error when an integer is out of range. This is a logical and consistent interpretation in the absence of an encoding, because in that case you don't care about the encoding - it's just raw data. If, however, you include an encoding, then you're stating that you want to encode the *meaning* of the string, not merely its integer values. >What would bytes("abc\xf0", "latin-1") *mean*? Take the string >"abc\xf0", interpret it as being encoded in XXX, and then encode from >XXX to Latin-1. But what's XXX? As I showed in a previous post, >"abc\xf0".encode("latin-1") *fails* because the source for the >encoding is assumed to be ASCII. I'm saying that XXX would be the same encoding as you specified. i.e., including an encoding means you are encoding the *meaning* of the string. However, I believe I mainly proposed this as an alternative to having bytes(str_or_unicode) work like bytes(map(ord,str_or_unicode)), which I think is probably a saner default. >Your argument for symmetry would be a lot stronger if we used Latin-1 >for the conversion between str and Unicode. But we don't. But that's because we're dealing with its meaning *as a string*, not merely as ordinals in a sequence of bytes. > I like the >other interpretation (which I thought was yours too?) much better: str ><--> bytes conversions don't use encodings by simply change the type >without changing the bytes; I like it better too. The part you didn't like was where MAL and I believe this should be extended to Unicode characters in the 0-255 range also. :) >There's one property that bytes, str and unicode all share: type(x[0]) >== type(x), at least as long as len(x) >= 1. This is perhaps the >ultimate test for string-ness. > >Or should b[0] be an int, if b is a bytes object? That would change >things dramatically. +1 for it being an int. Heck, I'd want to at least consider the possibility of introducing a character type (chr?) in Python 3.0, and getting rid of the "iterating a string yields strings" characteristic. I've found it to be a bit of a pain when dealing with heterogeneous nested sequences that contain strings. >There's also the consideration for APIs that, informally, accept >either a string or a sequence of objects. Many of these exist, and >they are probably all being converted to support unicode as well as >str (if it makes sense at all). Should a bytes object be considered as >a sequence of things, or as a single thing, from the POV of these >types of APIs? Should we try to standardize how code tests for the >difference? (Currently all sorts of shortcuts are being taken, from >isinstance(x, (list, tuple)) to isinstance(x, basestring).) I'm inclined to think of certain features at least in terms of the buffer interface, but that's not something that's really exposed at the Python level. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com