Re: [Cython] Switching from Py_UNICODE to Py_UCS4

Stefan Behnel Sat, 29 Jan 2011 02:35:46 -0800

Robert Bradshaw, 29.01.2011 10:01:
> On Fri, Jan 28, 2011 at 11:37 PM, Stefan Behnel wrote:
>> there is a recent discussion on python-dev about a new memory layout for
>> the unicode type in CPython 3.3(?), proposed by Martin von Löwis (so it's
>> serious ;)
>>
>> http://comments.gmane.org/gmane.comp.python.devel/120784
>
> That's an interesting PEP, I like it.


Yep, after some discussion, I started liking it too. Even if it means I'll 
have to touch a lot of code in Cython again. ;)


>> If nothing else, it gave me a new view on Py_UCS4 (basically a 32bit
>> unsigned int), which I had completely lost from sight. It's public and
>> undocumented and has been there basically forever, but it's a much nicer
>> type to support than Py_UNICODE, which changes size based on build time
>> options. Py_UCS4 is capable of representing any Unicode code point on any
>> platform.
>>
>> So, I'm proposing to switch from the current Py_UNICODE support to Py_UCS4
>> internally (without breaking user code which can continue to use either of
>> the two explicitly). This means that loops over unicode objects will infer
>> Py_UCS4 as loop variable, as would indexing. It would basically become the
>> native C type that 1 character unicode strings would coerce to and from.
>> Coercion from Py_UCS4 to Py_UNICODE would raise an exception if the value
>> is too large in the given CPython runtime, as would write access to unicode
>> objects (in case anyone really does that) outside of the platform specific
>> Py_UNICODE value range. Writing to unicode buffers will be dangerous and
>> tricky anyway if the above PEP gets accepted.
>
> I am a bit concerned about the performance overhead of the Py_UCS4 to
> Py_UNICODE coercion (e.g. if constructing a Py_UNICODE* by hand), but
> maybe that's both uncommon and negligible.

I think so. If users deal with Py_UNICODE explicitly, they'll likely type 
their respective variables anyway, so that there won't be an intermediate 
step through Py_UCS4. And on 32bit Unicode builds this isn't an issue at all.


>> One open question that I see is whether we should handle surrogate pairs
>> automatically. They are basically a split of large Unicode code point
>> values (>65535) into two code points in specific ranges that are safe to
>> detect. So we could allow a 2 'character' surrogate pair in a unicode
>> string to coerce to one Py_UCS4 character and coerce that back into a
>> surrogate pair string if the runtime uses 16 bit for Py_UNICODE. Note that
>> this would only work for single characters, not for looping or indexing
>> (without the PEP, that is). So it's somewhat inconsistent. It would work
>> well for literals, though. Also, we'd have to support it for 'in' tests, as
>> a Py_UCS4 value may simply not be in a Py_UNICODE buffer, even though the
>> character is in the string.
>
> No, I don't think we should handle surrogate pairs automatically, at
> least without making it optional--this could be a significant
> performance impact with little benefit for most users. Using these
> higher characters is rare, but using them on a non USS4 build is
> probably even rarer.

Well, basically they are the only way to use 'wide' Unicode characters on 
16bit Unicode builds.

I think a unicode string of length 2 should be able to coerce into a 
Py_UCS4 value at runtime instead of raising the current exception because 
it's too long. For the opposite direction, integer to unicode string, you 
already get a string of length 2 on narrow builds, that's how 
unichr()/chr() work in Python 2/3. So, in a way, it's actually more 
consistent with how narrow builds work today. The only reason this isn't 
currently working in Cython is that Py_UNICODE is too small on narrow 
builds to represent the larger Unicode code points. If we switched to 
Py_UCS4, the problem would go away in narrow builds now and code could be 
written today that would easily continue to work efficiently in a post-PEP 
CPython as it wouldn't rely on the deprecated (and then inefficient) 
Py_UNICODE type anymore.

What about supporting surrogate pairs in 'in' tests only on narrow 
platforms? I mean, we could simply duplicate the search code for that, 
depending on how large the code point value really is at runtime. That code 
will become a lot more involved anyway when the PEP gets implemented.


> Also, this would be inconsistant with
> python-level slicing, indexing, and range, right?

Yes, it does not match well with slicing and indexing. That's the problem 
with narrow builds in both CPython and Cython. Only the PEP can fix that by 
basically dropping the restrictions of a narrow build.

Stefan
_______________________________________________
Cython-dev mailing list
Cython-dev@codespeak.net
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] Switching from Py_UNICODE to Py_UCS4

Reply via email to