Currently the PyUnicode object contains two caches: for UTF-8 representation and for wchar_t representation. They are needed not for optimization but for supporting C API which returns borrowed references for such representations.

The UTF-8 cache always was in unicode objects (but in Python 2 it was not a UTF-8 cache, but a 8-bit representation cache). Initially it was needed for compatibility with 8-bit str, for implementing the "s" and "z" format units in PyArg_Parse(). Now it is used also for PyUnicode_AsUTF8() and PyUnicode_AsUTF8AndSize().

The wchar_t cache was added with PEP 393 in 3.3 as a replacement for the former Py_UNICODE representation. Now Py_UNICODE is defined as an alias of wchar_t, and the C API which returned a pointer to Py_UNICODE content returns now a pointer to the cached wchar_t representation. It is the "u" and "Z" format units in PyArg_Parse(), PyUnicode_AsUnicode(), PyUnicode_AsUnicodeAndSize(), PyUnicode_GET_SIZE(), PyUnicode_GET_DATA_SIZE(), PyUnicode_AS_UNICODE(), PyUnicode_AS_DATA().

All this increase the size of the unicode object. It includes the constant overhead of additional pointer and size fields, and the overhead of the cached representation proportional to the string length. The following table contains number of bytes per character for different kinds, with and without filling specified caches.

       raw  +utf8     +wchar_t       +utf8+wchar_t
                   Windows  Linux   Windows  Linux
ASCII   1     1       3       5        3       5
UCS1    1    2-3      3       5       4-5     6-7
UCS2    2    3-5      2       6       3-5     7-9
UCS4    4    5-8     6-8      4       7-12    5-8

There is also a new C API added in 3.3 for getting wchar_t representation without using the cache: PyUnicode_AsWideChar() and PyUnicode_AsWideCharString(). Currently it uses the cache, this has benefits and disadvantages.

Old Py_UNICODE based API is deprecated, and will be removed eventually.
I want to ask about the future of the wchar_t cache. Is the benefit of caching the wchar_t representation larger the disadvantage of spending more memory? The wchar_t representation is so natural for Windows API as the UTF8 representation for POSIX API. But in all other cases it is just waste of memory. Are there reasons of keeping the wchar_t cache after removing the deprecated API?

I have rewrote PyUnicode_AsWideChar() and PyUnicode_AsWideCharString(). They were implemented via the old Py_UNICODE based API, and now they don't use deprecated functions. They still use the wchar_t cache if it was created by previous use of the deprecated API, but don't create it themselves. Is this the correct decision?

https://bugs.python.org/issue30863

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to