STINNER Victor <[email protected]> added the comment:
Various comments of the PEP 393 and your patch.
"For compatibility with existing APIs, several representations
may exist in parallel; over time, this compatibility should be phased
out."
and
"For compatibility, redundant representations may be computed."
I never understood this statement: in most cases, PyUnicode_READY() replaces
the Py_UNICODE* (wchar_t*) representation by a compact representation.
PyUnicode_AsUnicode(), PyUnicode_AS_UNICODE(), PyUnicode_GET_SIZE(), ... do
reallocate a Py_UNICODE* string for a ready string, but I don't think that it
is a usual use case.
PyUnicode_AS_UNICODE() & friends are usually only used to build strings. So
this issue should be documented in a section different than the Abstract, maybe
in a Limitations section.
So even if a third party module uses the legagy Unicode API, the PEP 393 will
still optimize the memory usage thanks to implicit calls to PyUnicode_READY()
(done everywhere in Python source code).
In the current code, the most common case where a string has two
representations is the conversion to wchar_t* on Windows. PyUnicode_AsUnicode()
is used to encode arguments for the Windows Unicode API, and
PyUnicode_AsUnicode() keeps the result in the wstr attribute.
Note: there is also the utf8 attribute which may contain a third representation
if PyUnicode_AsUTF8() or PyUnicode_AsUTF8AndSize() (or the old
_PyUnicode_AsString()) is called.
"Objects for which the maximum character is not given at creation time are
called "legacy" objects, created through PyUnicode_FromStringAndSize(NULL,
length)."
They can also be created by PyUnicode_FromUnicode().
"Resizing a Unicode string remains possible until it is finalized, generally by
calling PyUnicode_READY."
I changed PyUnicode_Resize(): it is now *always* possible to resize a string.
The change was required because some decoders overallocate the string, and then
resize after decoding the input.
The sentence can be simply removed.
+ + 000 => str is not initialized (data are in wstr)
+ + 001 => 1 byte (Latin-1)
+ + 010 => 2 byte (UCS-2)
+ + 100 => 4 byte (UCS-4)
+ + Other values are reserved at this time.
I don't like binary numbers, I would prefer decimal numbers here. Binary was
maybe useful when we used bit masks, but we are now using the C "unsigned int
field:bit_size;" trick for a nicer API. With the new values, it is even easier
to remember them:
1 byte <=> kind=1
2 bytes <=> kind=2
4 bytes <=> kind=4
"[PyUnicode_AsUTF8] is thus identical to the existing _PyUnicode_AsString,
which is removed"
_PyUnicode_AsString() does still exist and is still heavily used (66 calls). It
is not documented as deprecated in What's New in Python 3.3 (but it is a
private function, so nobody uses it, right?.
"This section summarizes the API additions."
PyUnicode_IS_ASCII() is missing.
PyUnicode_CHARACTER_SIZE() has been removed (use kind directly).
UCS4 utility functions:
Py_UCS4_{strlen, strcpy, strcat, strncpy, strcmp, strncpy, strcmp, strncmp,
strchr, strrchr} have been removed.
"The following functions are added to the stable ABI (PEP 384), as they
are independent of the actual representation of Unicode objects: ...
... PyUnicode_WriteChar ...."
PyUnicode_WriteChar() allows to modify an immutable object, which is something
specific to CPython. Well, the function does now raise an error if the string
is no more modifiable (e.g. more than 1 reference to the string, the hash has
already been computed, etc.), but I don't know if it should be added to the
stable ABI.
"PyUnicode_AsUnicodeAndSize"
This function was added to Python 3.3 and is directly deprecated. Why adding a
function to deprecate it? PyUnicode_AsUnicode() and PyUnicode_GET_SIZE() were
not enough?
"Deprecations, Removals, and Incompatibilities"
Missing: PyUnicode_AS_DATA(), Py_UNICODE_strncpy, Py_UNICODE_strncmp
--
A very important point is not well explained: it is very important that a
("final") string is in its canonical representation. It means that a UCS2
string must contain at least a character bigger than U+00FF for example. Not
only some optimizations rely on the canonical representation, but also some
core methods of the Unicode type.
I tried to list all properties of Unicode objects in the definition of the
PyASCIIbject structure. And I implemented checks in
_PyUnicode_CheckConsistency(). This method is only available in debug mode.
----------
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue13604>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com