"Martin v. Löwis", 28.01.2011 22:49:
And indeed, when Cython is updated to 3.3, it shouldn't access the UTF-8
representation for such a loop. Instead, it should access the str
representation

Sure.


Regarding Cython specifically, the above will still be *possible* under
the proposal, given that the memory layout of the strings will still
represent the Unicode code points. It will just be trickier to implement
in Cython's type system as there is no longer a (user visible) C type
representation for those code units.

There is: Py_UCS4 remains available.

Thanks for that pointer. I had always thought that all "*UCS4*" names were platform specific and had completely missed that type. It's a lot nicer than Py_UNICODE because it allows users to fold surrogate pairs back into the character value.

It's completely missing from the docs, BTW. Google doesn't give me a single mention for all of docs.python.org, even though it existed at least since (and likely long before) Cython's oldest supported runtime Python 2.3.

If I had known about that type earlier, I could have ended up making that the native Unicode character type in Cython instead of bothering with Py_UNICODE. But this can still be changed I think. Since type inference was available before native Py_UNICODE support, it's unlikely that users will have Py_UNICODE written in their code explicitly. So I can make the switch under the hood.

Just to explain, a native CPython C type is much better than an arbitrary integer type, because it allows Cython to apply specific coercion rules from and to Python object types. As currently Py_UNICODE, Py_UCS4 would obviously coerce from and to a 1 character Unicode string, but it could additionally handle surrogate pair splitting and combining automatically on current 16-bit Unicode builds so that you'd get a Unicode string with two code points on coercion to Python.


While I'm somewhat confident that I'll
find a way to fix this in Cython, my point is just that this adds a
certain level of complexity to C code using the new memory layout that
simply wasn't there before.

Understood. However, I think it is easier than you think it is.

Let's see about the implications once there is an implementation.

Stefan

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to