Robert Bradshaw wrote:
> On Sep 30, 2009, at 10:08 AM, Matthew Honnibal wrote:
>> I've only just started using Cython today, and I'm having trouble with
>> the buffer interface indexing described here:
>> http://wiki.cython.org/enhancements/buffer . I want to iterate over a
>> unicode string getting contiguous subsequences.
> 
> I'm not sure what the intended behavior is, given that unicode  
> objects can be stored with various encodings, and, in particular, the  
> default (UTF-8) is a variable-length encoding.

UTF-8 is not used here. However, the real encoding of the underlying byte
buffer is platform specific and one of UCS-4 or UCS-2 as little endian or
big endian, i.e. four possible encodings. I never tried, but I'd expect the
buffer view of unicode objects to point to that internal buffer.

Anyway, the fastest way to get to the data content of a unicode object are
the C-API functions PyUnicode_AS_DATA() and PyUnicode_GET_DATA_SIZE(), or
if you prefer character chunked data instead of byte data,
PyUnicode_AS_UNICODE() and PyUnicode_GET_SIZE().

Also see the definition of the Py_UNICODE character type, which can be a
2-byte or 4-byte integer type, or a wchar_t type, depending on the platform.

http://docs.python.org/c-api/unicode.html#Py_UNICODE

Stefan
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Reply via email to