On Thu, 08 Mar 2012 08:48:58 +1100, Ben Finney wrote: > John Nagle <na...@animats.com> writes: > >> The library bug, if any, is that you can't apply >> >> unicode(s, errors='replace') >> >> to a Unicode string. TypeError("Decoding unicode is not supported") is >> raised. However >> >> unicode(s) >> >> will accept Unicode input. > > I think that's a Python bug. If the latter succeeds as a no-op, the > former should also succeed as a no-op. Neither should ever get any > errors when ‘s’ is a ‘unicode’ object already.
No. The semantics of the unicode function (technically: a type constructor) are well-defined, and there are two distinct behaviours: unicode(obj) is analogous to str(obj), and it attempts to convert obj to a unicode string by calling obj.__unicode__, if it exists, or __str__ if it doesn't. No encoding or decoding is attempted in the event that obj is a unicode instance. unicode(obj, encoding, errors) is explicitly stated in the docs as decoding obj if EITHER of encoding or errors is given, AND that obj must be either an 8-bit string (bytes) or a buffer object. It is true that u''.decode() will succeed, in Python 2, but the fact that unicode objects have a decode method at all is IMO a bug. It has also been corrected in Python 3, where (unicode) str objects no longer have a decode method, and bytes objects no longer have an encode method. >> The Python documentation >> ("http://docs.python.org/library/functions.html#unicode") does not >> mention this. Yes it does. It is is the SECOND sentence, immediately after the summary line: unicode([object[, encoding[, errors]]]) Return the Unicode string version of object using one of the following modes: If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bit string or a character buffer using the codec for encoding. ... Admittedly, it doesn't *explicitly* state that TypeError will be raised, but what other exception kind would you expect when you supply an argument of the wrong type? >> It is therefore necessary to check the type before >> calling "unicode", or catch the undocumented TypeError exception >> afterward. > > Yes, this check should not be necessary; calling the ‘unicode’ > constructor with an object that's already an instance of ‘unicode’ > should just return the object as-is, IMO. It shouldn't matter that > you've specified how decoding errors are to be handled, because in that > case no decoding happens anyway. I don't believe that it is the job of unicode() to Do What I Mean, but only to Do What I Say. If I *explicitly* tell unicode() to decode the argument (by specifying either the codec or the error handler or both) then it should not double-guess me and ignore the extra parameters. End-user applications may, with care, try to be smart and DWIM, but library functions should be dumb and should do what they are told. -- Steven -- http://mail.python.org/mailman/listinfo/python-list