Serhiy Storchaka schrieb am 20.10.2018 um 13:06: > Currently the PyUnicode object contains two caches: for UTF-8 > representation and for wchar_t representation. They are needed not for > optimization but for supporting C API which returns borrowed references for > such representations. > > The UTF-8 cache always was in unicode objects (but in Python 2 it was not a > UTF-8 cache, but a 8-bit representation cache). Initially it was needed for > compatibility with 8-bit str, for implementing the "s" and "z" format units > in PyArg_Parse(). Now it is used also for PyUnicode_AsUTF8() and > PyUnicode_AsUTF8AndSize(). > > The wchar_t cache was added with PEP 393 in 3.3 as a replacement for the > former Py_UNICODE representation. Now Py_UNICODE is defined as an alias of > wchar_t, and the C API which returned a pointer to Py_UNICODE content > returns now a pointer to the cached wchar_t representation. It is the "u" > and "Z" format units in PyArg_Parse(), PyUnicode_AsUnicode(), > PyUnicode_AsUnicodeAndSize(), PyUnicode_GET_SIZE(), > PyUnicode_GET_DATA_SIZE(), PyUnicode_AS_UNICODE(), PyUnicode_AS_DATA(). > > All this increase the size of the unicode object. It includes the constant > overhead of additional pointer and size fields, and the overhead of the > cached representation proportional to the string length. The following > table contains number of bytes per character for different kinds, with and > without filling specified caches. > > raw +utf8 +wchar_t +utf8+wchar_t > Windows Linux Windows Linux > ASCII 1 1 3 5 3 5 > UCS1 1 2-3 3 5 4-5 6-7 > UCS2 2 3-5 2 6 3-5 7-9 > UCS4 4 5-8 6-8 4 7-12 5-8 > > There is also a new C API added in 3.3 for getting wchar_t representation > without using the cache: PyUnicode_AsWideChar() and > PyUnicode_AsWideCharString(). Currently it uses the cache, this has > benefits and disadvantages. > > Old Py_UNICODE based API is deprecated, and will be removed eventually. > I want to ask about the future of the wchar_t cache. Is the benefit of > caching the wchar_t representation larger the disadvantage of spending more > memory? The wchar_t representation is so natural for Windows API as the > UTF8 representation for POSIX API. But in all other cases it is just waste > of memory. Are there reasons of keeping the wchar_t cache after removing > the deprecated API?
I'd be happy to get rid of it. But regarding the use under Windows, I wonder if there's interest in keeping it as a special Windows-only feature, e.g. to speed up the data exchange with the Win32 APIs. I guess it would have to provide a visible (performance?) advantage to justify such special casing over the code removal. Stefan _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com