Travis Oliphant <[EMAIL PROTECTED]> wrote:
> I think you are right.  In the discussions for unifying string/unicode I 
> really like the proposals that are leaning toward having a unicode 
> object be an immutable string of either ucs-1, ucs-2, or ucs-4 depending 
> on what is in the string.

Except that its not going to happen.  The width of the unicode
representation is going to be fixed at compile time, generally utf-16 or
ucs-4.  I say utf-16 because the representation allows for surrogate
pairs, etc., but each value of the pair are considered a "character",
where as (according to my potentially flawed memory of reading the spec)
ucs-2 doesn't allow for surrogates.

Note that I previously offered an overlay structure that could support
the O(logn) time access of arbitrary full characters regardless of
encoding (utf-8, utf-16 or ucs-4) using O(logn) space, but it was
decided by Guido that Python should return partial character (half of a
surrogate pair) rather than offer non-constant character access time.*

 - Josiah

* As a side note, the space and time is really a function of how often
surrogates or their equivalent in utf-8, etc., occurred.  In worst-case
O(logn) for both, but is actually a function of the structure of
occurrances of the non-constant character lengths.

_______________________________________________
Python-3000 mailing list
Python-3000@python.org
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Reply via email to