On Fri, Jan 28, 2011 at 11:37 PM, Stefan Behnel <stefan...@behnel.de> wrote: > Hi, > > there is a recent discussion on python-dev about a new memory layout for > the unicode type in CPython 3.3(?), proposed by Martin von Löwis (so it's > serious ;) > > http://comments.gmane.org/gmane.comp.python.devel/120784
That's an interesting PEP, I like it. > If nothing else, it gave me a new view on Py_UCS4 (basically a 32bit > unsigned int), which I had completely lost from sight. It's public and > undocumented and has been there basically forever, but it's a much nicer > type to support than Py_UNICODE, which changes size based on build time > options. Py_UCS4 is capable of representing any Unicode code point on any > platform. > > So, I'm proposing to switch from the current Py_UNICODE support to Py_UCS4 > internally (without breaking user code which can continue to use either of > the two explicitly). This means that loops over unicode objects will infer > Py_UCS4 as loop variable, as would indexing. It would basically become the > native C type that 1 character unicode strings would coerce to and from. > Coercion from Py_UCS4 to Py_UNICODE would raise an exception if the value > is too large in the given CPython runtime, as would write access to unicode > objects (in case anyone really does that) outside of the platform specific > Py_UNICODE value range. Writing to unicode buffers will be dangerous and > tricky anyway if the above PEP gets accepted. I am a bit concerned about the performance overhead of the Py_UCS4 to Py_UNICODE coercion (e.g. if constructing a Py_UNICODE* by hand), but maybe that's both uncommon and negligible. > One open question that I see is whether we should handle surrogate pairs > automatically. They are basically a split of large Unicode code point > values (>65535) into two code points in specific ranges that are safe to > detect. So we could allow a 2 'character' surrogate pair in a unicode > string to coerce to one Py_UCS4 character and coerce that back into a > surrogate pair string if the runtime uses 16 bit for Py_UNICODE. Note that > this would only work for single characters, not for looping or indexing > (without the PEP, that is). So it's somewhat inconsistent. It would work > well for literals, though. Also, we'd have to support it for 'in' tests, as > a Py_UCS4 value may simply not be in a Py_UNICODE buffer, even though the > character is in the string. > > Comments? No, I don't think we should handle surrogate pairs automatically, at least without making it optional--this could be a significant performance impact with little benefit for most users. Using these higher characters is rare, but using them on a non USS4 build is probably even rarer. Also, this would be inconsistant with python-level slicing, indexing, and range, right? - Robert _______________________________________________ Cython-dev mailing list Cython-dev@codespeak.net http://codespeak.net/mailman/listinfo/cython-dev