> The nice thing about Py_UNICODE is that is basically gives you native > Unicode code points directly, without needing to decode UTF-8 byte runs > and the like. In Cython, it allows you to do things like this: > > def test_for_those_characters(unicode s): > for c in s: > # warning: randomly chosen Unicode escapes ahead > if c in u"\u0356\u1012\u3359\u4567": > return True > else: > return False > > The loop runs in plain C, using the somewhat obvious implementation with > a loop over Py_UNICODE characters and a switch statement for the > comparison. This would look a *lot* more ugly with UTF-8 encoded byte > strings.
And indeed, when Cython is updated to 3.3, it shouldn't access the UTF-8 representation for such a loop. Instead, it should access the str representation, and might compile this to code like #define Cython_CharAt(data, kind, pos) kind==LATIN1 ? \ ((unsigned char*)data)[pos] : kind==UCS2 ? \ ((unsigned short*)data)[pos] : \ ((Py_UCS4*)data)[pos] void *data = PyUnicode_Data(s); int kind = PyUnicode_Kind(s); for(int pos=0; pos < PyUnicode_Size(s); pos++){ Py_UCS4 c = Cython_CharAt(data, kind, pos); Py_UCS4 tmp = {0x356, 0x1012, 0x3359, 0x4567}; for (int k=0; k<4; k++) if (c == tmp[k]) return 1; } return 0; > Regarding Cython specifically, the above will still be *possible* under > the proposal, given that the memory layout of the strings will still > represent the Unicode code points. It will just be trickier to implement > in Cython's type system as there is no longer a (user visible) C type > representation for those code units. There is: Py_UCS4 remains available. > It can be any of uchar, ushort16 or > uint32, neither of which is necessarily a 'native' representation of a > Unicode character in CPython. There won't be a "native" representation anymore - that's the whole point of the PEP. > While I'm somewhat confident that I'll > find a way to fix this in Cython, my point is just that this adds a > certain level of complexity to C code using the new memory layout that > simply wasn't there before. Understood. However, I think it is easier than you think it is. Regards, Martin _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com