Re: [Cython] Switching from Py_UNICODE to Py_UCS4

Robert Bradshaw Sat, 29 Jan 2011 01:02:14 -0800

On Fri, Jan 28, 2011 at 11:37 PM, Stefan Behnel <stefan...@behnel.de> wrote:
> Hi,
>
> there is a recent discussion on python-dev about a new memory layout for
> the unicode type in CPython 3.3(?), proposed by Martin von Löwis (so it's
> serious ;)
>
> http://comments.gmane.org/gmane.comp.python.devel/120784


That's an interesting PEP, I like it.

> If nothing else, it gave me a new view on Py_UCS4 (basically a 32bit
> unsigned int), which I had completely lost from sight. It's public and
> undocumented and has been there basically forever, but it's a much nicer
> type to support than Py_UNICODE, which changes size based on build time
> options. Py_UCS4 is capable of representing any Unicode code point on any
> platform.
>
> So, I'm proposing to switch from the current Py_UNICODE support to Py_UCS4
> internally (without breaking user code which can continue to use either of
> the two explicitly). This means that loops over unicode objects will infer
> Py_UCS4 as loop variable, as would indexing. It would basically become the
> native C type that 1 character unicode strings would coerce to and from.
> Coercion from Py_UCS4 to Py_UNICODE would raise an exception if the value
> is too large in the given CPython runtime, as would write access to unicode
> objects (in case anyone really does that) outside of the platform specific
> Py_UNICODE value range. Writing to unicode buffers will be dangerous and
> tricky anyway if the above PEP gets accepted.

I am a bit concerned about the performance overhead of the Py_UCS4 to
Py_UNICODE coercion (e.g. if constructing a Py_UNICODE* by hand), but
maybe that's both uncommon and negligible.

> One open question that I see is whether we should handle surrogate pairs
> automatically. They are basically a split of large Unicode code point
> values (>65535) into two code points in specific ranges that are safe to
> detect. So we could allow a 2 'character' surrogate pair in a unicode
> string to coerce to one Py_UCS4 character and coerce that back into a
> surrogate pair string if the runtime uses 16 bit for Py_UNICODE. Note that
> this would only work for single characters, not for looping or indexing
> (without the PEP, that is). So it's somewhat inconsistent. It would work
> well for literals, though. Also, we'd have to support it for 'in' tests, as
> a Py_UCS4 value may simply not be in a Py_UNICODE buffer, even though the
> character is in the string.
>
> Comments?

No, I don't think we should handle surrogate pairs automatically, at
least without making it optional--this could be a significant
performance impact with little benefit for most users. Using these
higher characters is rare, but using them on a non USS4 build is
probably even rarer. Also, this would be inconsistant with
python-level slicing, indexing, and range, right?

- Robert
_______________________________________________
Cython-dev mailing list
Cython-dev@codespeak.net
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] Switching from Py_UNICODE to Py_UCS4

Reply via email to