Hi, there is a recent discussion on python-dev about a new memory layout for the unicode type in CPython 3.3(?), proposed by Martin von Löwis (so it's serious ;)
http://comments.gmane.org/gmane.comp.python.devel/120784 If nothing else, it gave me a new view on Py_UCS4 (basically a 32bit unsigned int), which I had completely lost from sight. It's public and undocumented and has been there basically forever, but it's a much nicer type to support than Py_UNICODE, which changes size based on build time options. Py_UCS4 is capable of representing any Unicode code point on any platform. So, I'm proposing to switch from the current Py_UNICODE support to Py_UCS4 internally (without breaking user code which can continue to use either of the two explicitly). This means that loops over unicode objects will infer Py_UCS4 as loop variable, as would indexing. It would basically become the native C type that 1 character unicode strings would coerce to and from. Coercion from Py_UCS4 to Py_UNICODE would raise an exception if the value is too large in the given CPython runtime, as would write access to unicode objects (in case anyone really does that) outside of the platform specific Py_UNICODE value range. Writing to unicode buffers will be dangerous and tricky anyway if the above PEP gets accepted. One open question that I see is whether we should handle surrogate pairs automatically. They are basically a split of large Unicode code point values (>65535) into two code points in specific ranges that are safe to detect. So we could allow a 2 'character' surrogate pair in a unicode string to coerce to one Py_UCS4 character and coerce that back into a surrogate pair string if the runtime uses 16 bit for Py_UNICODE. Note that this would only work for single characters, not for looping or indexing (without the PEP, that is). So it's somewhat inconsistent. It would work well for literals, though. Also, we'd have to support it for 'in' tests, as a Py_UCS4 value may simply not be in a Py_UNICODE buffer, even though the character is in the string. Comments? Stefan _______________________________________________ Cython-dev mailing list Cython-dev@codespeak.net http://codespeak.net/mailman/listinfo/cython-dev