Marc-Andre Lemburg <m...@egenix.com> added the comment: > Keep in mind that we should be able to access and use lone surrogates too, > therefore: > s = '\ud800' # should be valid > len(s) # should this raise an error? (or return 0.5 ;)? > s[0] # error here too? > list(s) # here too? > > p = s + '\udc00' > len(p) # 1? > s[0] # '\U00010000' ? > s[1] # IndexError? > list(p + 'a') # ['\ud800\udc00', 'a']? > > We can still decide that strings with lone surrogates work only with a > limited number of methods/functions but: > 1) it's not backward compatible; > 2) it's not very consistent > > Another thing I noticed is that (at least on wide builds) surrogate pairs are > not joined "on the fly": >>>> p > '\ud800\udc00' >>>> len(p) > 2 >>>> p.encode('utf-16').decode('utf-16') > '𐀀' >>>> len(_) > 1
Hi Tom, welcome to Python land :-) Here's some more background information on how Python's Unicode implementation works: You need to differentiate between Unicode code points stored in Unicode objects and ones encoded in transfer formats by codecs. We generally do allow lone surrogates, unassigned code points, lone combining code points, etc. in Unicode objects since Python needs to be able to work on all Unicode code points and build strings with them. The transfer format codecs do try to combine surrogates on decoding data on UCS4 builds. On UCS2 builds they create surrogate pairs as necessary. On output, those pairs will again be joined to get round-trip safety. It helps if you think of Python's Unicode objects using UCS2 and UCS4 instead of UTF-16/32. Python does try to make working with UCS2 easy and in many cases behaves as if it were using UTF-16 internally, but there are, of course, limits to this. In practice, you only rarely get to see any of these special cases, since non-BMP code points are usually not found in everyday use. If they do become a problem for you, you have the option of switching to a UCS4 build of Python. You also have to be aware of the fact that Python started Unicode in 1999/2000 with Unicode 2.0/3.0, so it uses the terminology of those versions, some of which has changed in more recent versions of Unicode. For more background information, you might want take a look at this talk from 2002: http://www.egenix.com/library/presentations/#PythonAndUnicode Related to the other tickets you opened You'll also find that collation and compression was already on the plate back then, but since no one step forward, it wasn't implemented. Cheers, -- Marc-Andre Lemburg eGenix.com ________________________________________________________________________ 2011-10-04: PyCon DE 2011, Leipzig, Germany 50 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ---------- nosy: +lemburg title: Python lib re cannot handle Unicode properly due to narrow/wide bug -> Python lib re cannot handle Unicode properly due to narrow/wide bug _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue12729> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com