Robert Bradshaw wrote: > On Sep 30, 2009, at 10:08 AM, Matthew Honnibal wrote: >> I've only just started using Cython today, and I'm having trouble with >> the buffer interface indexing described here: >> http://wiki.cython.org/enhancements/buffer . I want to iterate over a >> unicode string getting contiguous subsequences. > > I'm not sure what the intended behavior is, given that unicode > objects can be stored with various encodings, and, in particular, the > default (UTF-8) is a variable-length encoding.
UTF-8 is not used here. However, the real encoding of the underlying byte buffer is platform specific and one of UCS-4 or UCS-2 as little endian or big endian, i.e. four possible encodings. I never tried, but I'd expect the buffer view of unicode objects to point to that internal buffer. Anyway, the fastest way to get to the data content of a unicode object are the C-API functions PyUnicode_AS_DATA() and PyUnicode_GET_DATA_SIZE(), or if you prefer character chunked data instead of byte data, PyUnicode_AS_UNICODE() and PyUnicode_GET_SIZE(). Also see the definition of the Py_UNICODE character type, which can be a 2-byte or 4-byte integer type, or a wchar_t type, depending on the platform. http://docs.python.org/c-api/unicode.html#Py_UNICODE Stefan _______________________________________________ Cython-dev mailing list [email protected] http://codespeak.net/mailman/listinfo/cython-dev
