[Python-Dev] The future of the wchar_t cache

Serhiy Storchaka Sat, 20 Oct 2018 04:10:17 -0700

Currently the PyUnicode object contains two caches: for UTF-8representation and for wchar_t representation. They are needed not foroptimization but for supporting C API which returns borrowed referencesfor such representations.

The UTF-8 cache always was in unicode objects (but in Python 2 it wasnot a UTF-8 cache, but a 8-bit representation cache). Initially it wasneeded for compatibility with 8-bit str, for implementing the "s" and"z" format units in PyArg_Parse(). Now it is used also forPyUnicode_AsUTF8() and PyUnicode_AsUTF8AndSize().

The wchar_t cache was added with PEP 393 in 3.3 as a replacement for theformer Py_UNICODE representation. Now Py_UNICODE is defined as an aliasof wchar_t, and the C API which returned a pointer to Py_UNICODE contentreturns now a pointer to the cached wchar_t representation. It is the"u" and "Z" format units in PyArg_Parse(), PyUnicode_AsUnicode(),PyUnicode_AsUnicodeAndSize(), PyUnicode_GET_SIZE(),PyUnicode_GET_DATA_SIZE(), PyUnicode_AS_UNICODE(), PyUnicode_AS_DATA().

All this increase the size of the unicode object. It includes theconstant overhead of additional pointer and size fields, and theoverhead of the cached representation proportional to the string length.The following table contains number of bytes per character for differentkinds, with and without filling specified caches.


       raw  +utf8     +wchar_t       +utf8+wchar_t
                   Windows  Linux   Windows  Linux
ASCII   1     1       3       5        3       5
UCS1    1    2-3      3       5       4-5     6-7
UCS2    2    3-5      2       6       3-5     7-9
UCS4    4    5-8     6-8      4       7-12    5-8

There is also a new C API added in 3.3 for getting wchar_trepresentation without using the cache: PyUnicode_AsWideChar() andPyUnicode_AsWideCharString(). Currently it uses the cache, this hasbenefits and disadvantages.


Old Py_UNICODE based API is deprecated, and will be removed eventually.

I want to ask about the future of the wchar_t cache. Is the benefit ofcaching the wchar_t representation larger the disadvantage of spendingmore memory? The wchar_t representation is so natural for Windows API asthe UTF8 representation for POSIX API. But in all other cases it is justwaste of memory. Are there reasons of keeping the wchar_t cache afterremoving the deprecated API?

I have rewrote PyUnicode_AsWideChar() and PyUnicode_AsWideCharString().They were implemented via the old Py_UNICODE based API, and now theydon't use deprecated functions. They still use the wchar_t cache if itwas created by previous use of the deprecated API, but don't create itthemselves. Is this the correct decision?


https://bugs.python.org/issue30863

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] The future of the wchar_t cache

Reply via email to