[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Marc-Andre Lemburg Mon, 15 Aug 2011 02:05:08 -0700

Marc-Andre Lemburg <[email protected]> added the comment:

> Keep in mind that we should be able to access and use lone surrogates too, 
> therefore:
> s = '\ud800'  # should be valid
> len(s)  # should this raise an error? (or return 0.5 ;)?
> s[0]  # error here too?
> list(s)  # here too?
> 
> p = s + '\udc00'
> len(p)  # 1?
> s[0]  # '\U00010000' ?
> s[1]  # IndexError?
> list(p + 'a')  # ['\ud800\udc00', 'a']?
> 
> We can still decide that strings with lone surrogates work only with a 
> limited number of methods/functions but:
> 1) it's not backward compatible;
> 2) it's not very consistent
> 
> Another thing I noticed is that (at least on wide builds) surrogate pairs are 
> not joined "on the fly":
>>>> p
> '\ud800\udc00'
>>>> len(p)
> 2
>>>> p.encode('utf-16').decode('utf-16')
> '𐀀'
>>>> len(_)
> 1


Hi Tom,

welcome to Python land :-) Here's some more background information
on how Python's Unicode implementation works:

You need to differentiate between Unicode code points stored in
Unicode objects and ones encoded in transfer formats by codecs.

We generally do allow lone surrogates, unassigned code
points, lone combining code points, etc. in Unicode objects
since Python needs to be able to work on all Unicode code points
and build strings with them.

The transfer format codecs do try to combine surrogates
on decoding data on UCS4 builds. On UCS2 builds they create
surrogate pairs as necessary. On output, those pairs will again
be joined to get round-trip safety.

It helps if you think of Python's Unicode objects using UCS2
and UCS4 instead of UTF-16/32. Python does try to make working
with UCS2 easy and in many cases behaves as if it were using
UTF-16 internally, but there are, of course, limits to this. In
practice, you only rarely get to see any of these special cases,
since non-BMP code points are usually not found in everyday
use. If they do become a problem for you, you have the option
of switching to a UCS4 build of Python.

You also have to be aware of the fact that Python started
Unicode in 1999/2000 with Unicode 2.0/3.0, so it uses the
terminology of those versions, some of which has changed in
more recent versions of Unicode.

For more background information, you might want take a look
at this talk from 2002:

http://www.egenix.com/library/presentations/#PythonAndUnicode

Related to the other tickets you opened You'll also find that
collation and compression was already on the plate back then,
but since no one step forward, it wasn't implemented.

Cheers,
-- 
Marc-Andre Lemburg
eGenix.com

________________________________________________________________________
2011-10-04: PyCon DE 2011, Leipzig, Germany                50 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/

----------
nosy: +lemburg
title: Python lib re cannot handle Unicode properly due to narrow/wide bug -> 
Python lib re cannot handle Unicode properly due to       narrow/wide bug

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue12729>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Reply via email to