Re: [Cython] Switching from Py_UNICODE to Py_UCS4

Stefan Behnel Sun, 06 Feb 2011 00:45:40 -0800

Robert Bradshaw, 04.02.2011 19:50:
> On Sat, Jan 29, 2011 at 2:35 AM, Stefan Behnel wrote:
>> Robert Bradshaw, 29.01.2011 10:01:
>>> On Fri, Jan 28, 2011 at 11:37 PM, Stefan Behnel wrote:
>>>> there is a recent discussion on python-dev about a new memory layout for
>>>> the unicode type in CPython 3.3(?), proposed by Martin von Löwis (so it's
>>>> serious ;)
>>>>
>>>> http://comments.gmane.org/gmane.comp.python.devel/120784
>>>
>>> That's an interesting PEP, I like it.
>>
>> Yep, after some discussion, I started liking it too. Even if it means I'll
>> have to touch a lot of code in Cython again. ;)
>>
>>
>>>> If nothing else, it gave me a new view on Py_UCS4 (basically a 32bit
>>>> unsigned int), which I had completely lost from sight. It's public and
>>>> undocumented and has been there basically forever, but it's a much nicer
>>>> type to support than Py_UNICODE, which changes size based on build time
>>>> options. Py_UCS4 is capable of representing any Unicode code point on any
>>>> platform.
>>>>
>>>> So, I'm proposing to switch from the current Py_UNICODE support to Py_UCS4
>>>> internally (without breaking user code which can continue to use either of
>>>> the two explicitly). This means that loops over unicode objects will infer
>>>> Py_UCS4 as loop variable, as would indexing. It would basically become the
>>>> native C type that 1 character unicode strings would coerce to and from.
>>>> Coercion from Py_UCS4 to Py_UNICODE would raise an exception if the value
>>>> is too large in the given CPython runtime, as would write access to unicode
>>>> objects (in case anyone really does that) outside of the platform specific
>>>> Py_UNICODE value range. Writing to unicode buffers will be dangerous and
>>>> tricky anyway if the above PEP gets accepted.
>>>
>>> I am a bit concerned about the performance overhead of the Py_UCS4 to
>>> Py_UNICODE coercion (e.g. if constructing a Py_UNICODE* by hand), but
>>> maybe that's both uncommon and negligible.
>>
>> I think so. If users deal with Py_UNICODE explicitly, they'll likely type
>> their respective variables anyway, so that there won't be an intermediate
>> step through Py_UCS4. And on 32bit Unicode builds this isn't an issue at all.


Coming back to this once more: if the PEP gets implemented, we will only 
know at C compile time (Py>=3.3 or not) if the result of indexing 
(including for-loop iteration) is Py_UCS4 or Py_UNICODE. For Cython's type 
inference, Py_UCS4 is therefore the more correct guess. So my proposal 
stands to always infer Py_UCS4 instead of Py_UNICODE for indexing, even if 
we ignore surrogate pairs in narrow Python builds.

I will implement this for now, so that we can see what it gives.


>>>> One open question that I see is whether we should handle surrogate pairs
>>>> automatically. They are basically a split of large Unicode code point
>>>> values (>65535) into two code points in specific ranges that are safe to
>>>> detect. So we could allow a 2 'character' surrogate pair in a unicode
>>>> string to coerce to one Py_UCS4 character and coerce that back into a
>>>> surrogate pair string if the runtime uses 16 bit for Py_UNICODE. Note that
>>>> this would only work for single characters, not for looping or indexing
>>>> (without the PEP, that is). So it's somewhat inconsistent. It would work
>>>> well for literals, though. Also, we'd have to support it for 'in' tests, as
>>>> a Py_UCS4 value may simply not be in a Py_UNICODE buffer, even though the
>>>> character is in the string.
>>>
>>> No, I don't think we should handle surrogate pairs automatically, at
>>> least without making it optional--this could be a significant
>>> performance impact with little benefit for most users. Using these
>>> higher characters is rare, but using them on a non USS4 build is
>>> probably even rarer.
>>
>> Well, basically they are the only way to use 'wide' Unicode characters on
>> 16bit Unicode builds.
>>
>> I think a unicode string of length 2 should be able to coerce into a
>> Py_UCS4 value at runtime instead of raising the current exception because
>> it's too long.
>
> Sure, that's fine by me.

This is now implemented for narrow builds.


>> For the opposite direction, integer to unicode string, you
>> already get a string of length 2 on narrow builds, that's how
>> unichr()/chr() work in Python 2/3. So, in a way, it's actually more
>> consistent with how narrow builds work today.
>
> OK.
>
>> The only reason this isn't
>> currently working in Cython is that Py_UNICODE is too small on narrow
>> builds to represent the larger Unicode code points. If we switched to
>> Py_UCS4, the problem would go away in narrow builds now and code could be
>> written today that would easily continue to work efficiently in a post-PEP
>> CPython as it wouldn't rely on the deprecated (and then inefficient)
>> Py_UNICODE type anymore.
>>
>> What about supporting surrogate pairs in 'in' tests only on narrow
>> platforms? I mean, we could simply duplicate the search code for that,
>> depending on how large the code point value really is at runtime. That code
>> will become a lot more involved anyway when the PEP gets implemented.
>
> Sure. This shouldn't have non-negligible performance overhead for the
> simple case, and would be consistent with coercing to a 2-character
> Unicode as above then doing the Python in operator.

Also implemented for narrow builds now, if the character type is Py_UCS4 
and not Py_UNICODE.


>>> Also, this would be inconsistant with
>>> python-level slicing, indexing, and range, right?
>>
>> Yes, it does not match well with slicing and indexing. That's the problem
>> with narrow builds in both CPython and Cython. Only the PEP can fix that by
>> basically dropping the restrictions of a narrow build.
>
> Lets let indexing do what indexing does.

Ok. So you'd continue to get whatever CPython returns for indexing, i.e. 
Py_UNICODE in Py<=3.2 and Py_UCS4 in Python versions that implement the 
PEP. That includes separate code points for surrogate pairs on narrow builds.

Stefan
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] Switching from Py_UNICODE to Py_UCS4

Reply via email to