[Cython] Switching from Py_UNICODE to Py_UCS4

Stefan Behnel Fri, 28 Jan 2011 23:38:06 -0800

Hi,

there is a recent discussion on python-dev about a new memory layout for 
the unicode type in CPython 3.3(?), proposed by Martin von Löwis (so it's 
serious ;)


http://comments.gmane.org/gmane.comp.python.devel/120784

If nothing else, it gave me a new view on Py_UCS4 (basically a 32bit 
unsigned int), which I had completely lost from sight. It's public and 
undocumented and has been there basically forever, but it's a much nicer 
type to support than Py_UNICODE, which changes size based on build time 
options. Py_UCS4 is capable of representing any Unicode code point on any 
platform.

So, I'm proposing to switch from the current Py_UNICODE support to Py_UCS4 
internally (without breaking user code which can continue to use either of 
the two explicitly). This means that loops over unicode objects will infer 
Py_UCS4 as loop variable, as would indexing. It would basically become the 
native C type that 1 character unicode strings would coerce to and from. 
Coercion from Py_UCS4 to Py_UNICODE would raise an exception if the value 
is too large in the given CPython runtime, as would write access to unicode 
objects (in case anyone really does that) outside of the platform specific 
Py_UNICODE value range. Writing to unicode buffers will be dangerous and 
tricky anyway if the above PEP gets accepted.

One open question that I see is whether we should handle surrogate pairs 
automatically. They are basically a split of large Unicode code point 
values (>65535) into two code points in specific ranges that are safe to 
detect. So we could allow a 2 'character' surrogate pair in a unicode 
string to coerce to one Py_UCS4 character and coerce that back into a 
surrogate pair string if the runtime uses 16 bit for Py_UNICODE. Note that 
this would only work for single characters, not for looping or indexing 
(without the PEP, that is). So it's somewhat inconsistent. It would work 
well for literals, though. Also, we'd have to support it for 'in' tests, as 
a Py_UCS4 value may simply not be in a Py_UNICODE buffer, even though the 
character is in the string.

Comments?

Stefan
_______________________________________________
Cython-dev mailing list
Cython-dev@codespeak.net
http://codespeak.net/mailman/listinfo/cython-dev

[Cython] Switching from Py_UNICODE to Py_UCS4

Reply via email to