Re: [Cython] Switching from Py_UNICODE to Py_UCS4

Robert Bradshaw Fri, 04 Feb 2011 10:51:26 -0800

On Sat, Jan 29, 2011 at 2:35 AM, Stefan Behnel <stefan...@behnel.de> wrote:
> Robert Bradshaw, 29.01.2011 10:01:
>> On Fri, Jan 28, 2011 at 11:37 PM, Stefan Behnel wrote:
>>> there is a recent discussion on python-dev about a new memory layout for
>>> the unicode type in CPython 3.3(?), proposed by Martin von Löwis (so it's
>>> serious ;)
>>>
>>> http://comments.gmane.org/gmane.comp.python.devel/120784
>>
>> That's an interesting PEP, I like it.
>
> Yep, after some discussion, I started liking it too. Even if it means I'll
> have to touch a lot of code in Cython again. ;)
>
>
>>> If nothing else, it gave me a new view on Py_UCS4 (basically a 32bit
>>> unsigned int), which I had completely lost from sight. It's public and
>>> undocumented and has been there basically forever, but it's a much nicer
>>> type to support than Py_UNICODE, which changes size based on build time
>>> options. Py_UCS4 is capable of representing any Unicode code point on any
>>> platform.
>>>
>>> So, I'm proposing to switch from the current Py_UNICODE support to Py_UCS4
>>> internally (without breaking user code which can continue to use either of
>>> the two explicitly). This means that loops over unicode objects will infer
>>> Py_UCS4 as loop variable, as would indexing. It would basically become the
>>> native C type that 1 character unicode strings would coerce to and from.
>>> Coercion from Py_UCS4 to Py_UNICODE would raise an exception if the value
>>> is too large in the given CPython runtime, as would write access to unicode
>>> objects (in case anyone really does that) outside of the platform specific
>>> Py_UNICODE value range. Writing to unicode buffers will be dangerous and
>>> tricky anyway if the above PEP gets accepted.
>>
>> I am a bit concerned about the performance overhead of the Py_UCS4 to
>> Py_UNICODE coercion (e.g. if constructing a Py_UNICODE* by hand), but
>> maybe that's both uncommon and negligible.
>
> I think so. If users deal with Py_UNICODE explicitly, they'll likely type
> their respective variables anyway, so that there won't be an intermediate
> step through Py_UCS4. And on 32bit Unicode builds this isn't an issue at all.
>
>
>>> One open question that I see is whether we should handle surrogate pairs
>>> automatically. They are basically a split of large Unicode code point
>>> values (>65535) into two code points in specific ranges that are safe to
>>> detect. So we could allow a 2 'character' surrogate pair in a unicode
>>> string to coerce to one Py_UCS4 character and coerce that back into a
>>> surrogate pair string if the runtime uses 16 bit for Py_UNICODE. Note that
>>> this would only work for single characters, not for looping or indexing
>>> (without the PEP, that is). So it's somewhat inconsistent. It would work
>>> well for literals, though. Also, we'd have to support it for 'in' tests, as
>>> a Py_UCS4 value may simply not be in a Py_UNICODE buffer, even though the
>>> character is in the string.
>>
>> No, I don't think we should handle surrogate pairs automatically, at
>> least without making it optional--this could be a significant
>> performance impact with little benefit for most users. Using these
>> higher characters is rare, but using them on a non USS4 build is
>> probably even rarer.
>
> Well, basically they are the only way to use 'wide' Unicode characters on
> 16bit Unicode builds.
>
> I think a unicode string of length 2 should be able to coerce into a
> Py_UCS4 value at runtime instead of raising the current exception because
> it's too long.


Sure, that's fine by me.

> For the opposite direction, integer to unicode string, you
> already get a string of length 2 on narrow builds, that's how
> unichr()/chr() work in Python 2/3. So, in a way, it's actually more
> consistent with how narrow builds work today.

OK.

> The only reason this isn't
> currently working in Cython is that Py_UNICODE is too small on narrow
> builds to represent the larger Unicode code points. If we switched to
> Py_UCS4, the problem would go away in narrow builds now and code could be
> written today that would easily continue to work efficiently in a post-PEP
> CPython as it wouldn't rely on the deprecated (and then inefficient)
> Py_UNICODE type anymore.
>
> What about supporting surrogate pairs in 'in' tests only on narrow
> platforms? I mean, we could simply duplicate the search code for that,
> depending on how large the code point value really is at runtime. That code
> will become a lot more involved anyway when the PEP gets implemented.

Sure. This shouldn't have non-negligible performance overhead for the
simple case, and would be consistent with coercing to a 2-character
Unicode as above then doing the Python in operator.

>> Also, this would be inconsistant with
>> python-level slicing, indexing, and range, right?
>
> Yes, it does not match well with slicing and indexing. That's the problem
> with narrow builds in both CPython and Cython. Only the PEP can fix that by
> basically dropping the restrictions of a narrow build.

Lets let indexing do what indexing does.

- Robert
_______________________________________________
Cython-dev mailing list
Cython-dev@codespeak.net
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] Switching from Py_UNICODE to Py_UCS4

Reply via email to