>>>>> "Bob" == Bob Ippolito <[EMAIL PROTECTED]> writes:
Bob> On Feb 17, 2006, at 8:33 PM, Josiah Carlson wrote: >> But you aren't always getting *unicode* text from the decoding >> of bytes, and you may be encoding bytes *to* bytes: Please note that I presumed that you can indeed assume that decoding of bytes always results in unicode, and encoding of unicode always results in bytes. I believe Guido made the proposal relying on that assumption too. The constructor notation makes no sense for making an object of the same type as the original unless it's a copy constructor. You could argue that the base64 language is indeed a different language from the bytes language, and I'd agree. But since there's no way in Python to determine whether a string that conforms to base64 is supposed to be base64 or bytes, it would be a very bad idea to interpret the distinction as one of type. >> b2 = bytes(b, "base64") >> b3 = bytes(b2, "base64") >> Which direction are we going again? Bob> This is *exactly* why the current set of codecs are INSANE. Bob> unicode.encode and str.decode should be used *only* for Bob> unicode codecs. Byte transforms are entirely different Bob> semantically and should be some other method pair. General filters are semantically different, I agree. But "encode" and "decode" in English are certainly far more general than character coding conversion. The use of those methods for any stream conversion that is invertible (eg, compression or encryption) is not insane. It's just pedagogically inconvenient given the existing confusion (outside of python-dev, of course<wink>) about character coding issues. I'd like to rephrase your statement as "*only* unicode.encode and str.decode should be used for unicode codecs". Ie, str.encode(codec) and unicode.decode(codec) should raise errors if codec is a "unicode codec". The question in my mind is whether we should allow other kinds of codecs or not. I could live with "not"<wink>, but if we're going to have other kinds of codecs, I think they should have concrete signatures. Ie, basestring -> basestring shouldn't be allowed. Content transfer encodings like BASE64 and quoted-printable, compression, encryption, etc IMO should be bytes -> bytes. Overloading to unicode -> unicode is sorta plausible for BASE64 or QP, but YAGNI. OTOH, the Unicode standard does define a number of unicode -> unicode transformations, and it might make sense to generalize to case conversions etc. (Note that these conversions are pseudo-invertible, so you can think of them as generalized .encode/.decode pairs. The inverse is usually the identity, which seems weird, but from the pedagogical standpoint you could handle that weirdness by raising an error if the .encode method were invoked.) To be concrete, I could imagine writing s2 = s1.decode('upcase') if s2 == s1: print "Why are you shouting at me?" else: print "I like calm, well-spoken snakes." s3 = s2.encode('upcase') if s3 == s2: print "Never fails!" else: print "See a vet; your Python is *very* sick." I chose the decode method to do the non-trivial transformation because .decode()'s value is supposed to be "original" text in MAL's terms. And that's true of uppercase-only text; you're still supposed to be able to read it, so I guess it's not "encoded". That's pretty pedantic; I think it's better to raise on .encode('upcase'). -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com