Pardon me for this drive-by posting, but this thread smells a lot like this old thread (don't be afraid to read it all, there are some good points in there; not directed at you Martin, but at all readers/posters in this thread)...
http://mail.python.org/pipermail/python-3000/2006-September/003795.html <http://mail.python.org/pipermail/python-3000/2006-September/003795.html>I'm not averse to faster and/or more memory efficient unicode representations (I would be quite happy with them, actually). I do see the usefulness of having non-utf-8 representations, and caching them is a good idea, though I wonder if that is a "good for Python itself to cache", or "good for the application to cache". The evil side of me says that we should just provide an API available in Python/C for "give me the representation of unicode string X using the 2byte/4byte code points", and have it just return the appropriate array.array() value (useful for passing to other APIs, or for those who need to do manual manipulation of code-points), or whatever structure is deemed to be appropriate. The less evil side of me says that going with what the PEP offers isn't a bad idea, and might just be a good idea. I'll defer my vote to Martin. Regards, - Josiah On Mon, Jan 24, 2011 at 12:17 PM, "Martin v. Löwis" <mar...@v.loewis.de>wrote: > I have been thinking about Unicode representation for some time now. > This was triggered, on the one hand, by discussions with Glyph Lefkowitz > (who complained that his server app consumes too much memory), and Carl > Friedrich Bolz (who profiled Python applications to determine that > Unicode strings are among the top consumers of memory in Python). > On the other hand, this was triggered by the discussion on supporting > surrogates in the library better. > > I'd like to propose PEP 393, which takes a different approach, > addressing both problems simultaneously: by getting a flexible > representation (one that can be either 1, 2, or 4 bytes), we can > support the full range of Unicode on all systems, but still use > only one byte per character for strings that are pure ASCII (which > will be the majority of strings for the majority of users). > > You'll find the PEP at > > http://www.python.org/dev/peps/pep-0393/ > > For convenience, I include it below. > > Regards, > Martin > > PEP: 393 > Title: Flexible String Representation > Version: $Revision: 88168 $ > Last-Modified: $Date: 2011-01-24 21:14:21 +0100 (Mo, 24. Jan 2011) $ > Author: Martin v. Löwis <mar...@v.loewis.de> > Status: Draft > Type: Standards Track > Content-Type: text/x-rst > Created: 24-Jan-2010 > Python-Version: 3.3 > Post-History: > > Abstract > ======== > > The Unicode string type is changed to support multiple internal > representations, depending on the character with the largest Unicode > ordinal (1, 2, or 4 bytes). This will allow a space-efficient > representation in common cases, but give access to full UCS-4 on all > systems. For compatibility with existing APIs, several representations > may exist in parallel; over time, this compatibility should be phased > out. > > Rationale > ========= > > There are two classes of complaints about the current implementation > of the unicode type: on systems only supporting UTF-16, users complain > that non-BMP characters are not properly supported. On systems using > UCS-4 internally (and also sometimes on systems using UCS-2), there is > a complaint that Unicode strings take up too much memory - especially > compared to Python 2.x, where the same code would often use ASCII > strings (i.e. ASCII-encoded byte strings). With the proposed approach, > ASCII-only Unicode strings will again use only one byte per character; > while still allowing efficient indexing of strings containing non-BMP > characters (as strings containing them will use 4 bytes per > character). > > One problem with the approach is support for existing applications > (e.g. extension modules). For compatibility, redundant representations > may be computed. Applications are encouraged to phase out reliance on > a specific internal representation if possible. As interaction with > other libraries will often require some sort of internal > representation, the specification choses UTF-8 as the recommended way > of exposing strings to C code. > > For many strings (e.g. ASCII), multiple representations may actually > share memory (e.g. the shortest form may be shared with the UTF-8 form > if all characters are ASCII). With such sharing, the overhead of > compatibility representations is reduced. > > Specification > ============= > > The Unicode object structure is changed to this definition:: > > typedef struct { > PyObject_HEAD > Py_ssize_t length; > void *str; > Py_hash_t hash; > int state; > Py_ssize_t utf8_length; > void *utf8; > Py_ssize_t wstr_length; > void *wstr; > } PyUnicodeObject; > > These fields have the following interpretations: > > - length: number of code points in the string (result of sq_length) > - str: shortest-form representation of the unicode string; the lower > two bits of the pointer indicate the specific form: > 01 => 1 byte (Latin-1); 11 => 2 byte (UCS-2); 11 => 4 byte (UCS-4); > 00 => null pointer > > The string is null-terminated (in its respective representation). > - hash, state: same as in Python 3.2 > - utf8_length, utf8: UTF-8 representation (null-terminated) > - wstr_length, wstr: representation in platform's wchar_t > (null-terminated). If wchar_t is 16-bit, this form may use surrogate > pairs (in which cast wstr_length differs form length). > > All three representations are optional, although the str form is > considered the canonical representation which can be absent only > while the string is being created. > > The Py_UNICODE type is still supported but deprecated. It is always > defined as a typedef for wchar_t, so the wstr representation can double > as Py_UNICODE representation. > > The str and utf8 pointers point to the same memory if the string uses > only ASCII characters (using only Latin-1 is not sufficient). The str > and wstr pointers point to the same memory if the string happens to > fit exactly to the wchar_t type of the platform (i.e. uses some > BMP-not-Latin-1 characters if sizeof(wchar_t) is 2, and uses some > non-BMP characters if sizeof(wchar_t) is 4). > > If the string is created directly with the canonical representation > (see below), this representation doesn't take a separate memory block, > but is allocated right after the PyUnicodeObject struct. > > String Creation > --------------- > > The recommended way to create a Unicode object is to use the function > PyUnicode_New:: > > PyObject* PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar); > > Both parameters must denote the eventual size/range of the strings. > In particular, codecs using this API must compute both the number of > characters and the maximum character in advance. An string is > allocated according to the specified size and character range and is > null-terminated; the actual characters in it may be unitialized. > > PyUnicode_FromString and PyUnicode_FromStringAndSize remain supported > for processing UTF-8 input; the input is decoded, and the UTF-8 > representation is not yet set for the string. > > PyUnicode_FromUnicode remains supported but is deprecated. If the > Py_UNICODE pointer is non-null, the str representation is set. If the > pointer is NULL, a properly-sized wstr representation is allocated, > which can be modified until PyUnicode_Finalize() is called (explicitly > or implicitly). Resizing a Unicode string remains possible until it > is finalized. > > PyUnicode_Finalize() converts a string containing only a wstr > representation into the canonical representation. Unless wstr and str > can share the memory, the wstr representation is discarded after the > conversion. > > String Access > ------------- > > The canonical representation can be accessed using two macros > PyUnicode_Kind and PyUnicode_Data. PyUnicode_Kind gives one of the > value PyUnicode_1BYTE (1), PyUnicode_2BYTE (2), or PyUnicode_4BYTE > (3). PyUnicode_Data gives the void pointer to the data, masking out > the pointer kind. All these functions call PyUnicode_Finalize > in case the canonical representation hasn't been computed yet. > > A new function PyUnicode_AsUTF8 is provided to access the UTF-8 > representation. It is thus identical to the existing > _PyUnicode_AsString, which is removed. The function will compute the > utf8 representation when first called. Since this representation will > consume memory until the string object is released, applications > should use the existing PyUnicode_AsUTF8String where possible > (which generates a new string object every time). API that implicitly > converts a string to a char* (such as the ParseTuple functions) will > use this function to compute a conversion. > > PyUnicode_AsUnicode is deprecated; it computes the wstr representation > on first use. > > String Operations > ----------------- > > Various convenience functions will be provided to deal with the > canonical representation, in particular with respect to concatenation > and slicing. > > Stable ABI > ---------- > > None of the functions in this PEP become part of the stable ABI. > > Copyright > ========= > > This document has been placed in the public domain. > _______________________________________________ > Python-Dev mailing list > Python-Dev@python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > http://mail.python.org/mailman/options/python-dev/josiah.carlson%40gmail.com >
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com