Travis Oliphant <[EMAIL PROTECTED]> wrote: > I think you are right. In the discussions for unifying string/unicode I > really like the proposals that are leaning toward having a unicode > object be an immutable string of either ucs-1, ucs-2, or ucs-4 depending > on what is in the string.
Except that its not going to happen. The width of the unicode representation is going to be fixed at compile time, generally utf-16 or ucs-4. I say utf-16 because the representation allows for surrogate pairs, etc., but each value of the pair are considered a "character", where as (according to my potentially flawed memory of reading the spec) ucs-2 doesn't allow for surrogates. Note that I previously offered an overlay structure that could support the O(logn) time access of arbitrary full characters regardless of encoding (utf-8, utf-16 or ucs-4) using O(logn) space, but it was decided by Guido that Python should return partial character (half of a surrogate pair) rather than offer non-constant character access time.* - Josiah * As a side note, the space and time is really a function of how often surrogates or their equivalent in utf-8, etc., occurred. In worst-case O(logn) for both, but is actually a function of the structure of occurrances of the non-constant character lengths. _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com