Paul Rubin schreef: > "Rhamphoryncus" <[EMAIL PROTECTED]> writes: >> Indexing cost, memory efficiency, and canonical representation: pick >> two. You can't use a canonical representation (scalar values) without >> some sort of costly search when indexing (O(log n) probably) or by >> expanding to the worst-case size (UTF-32). Python has taken the >> approach of always providing efficient indexing (O(1)), but you can >> compile it with either UTF-16 (better memory efficiency) or UTF-32 >> (canonical representation). > > I still don't get it. UTF-16 is just a data compression scheme, right? > I mean, s[17] isn't the 17th character of the (unicode) string regardless > of which memory byte it happens to live at? It could be that that accessing > it takes more than constant time, but that's hidden by the implementation. > > So where does the invariant c==s[s.index(c)] fail, assuming s contains c?
I didn't get it either, but now I understand. Like you, I thought Python Unicode strings contain a canonical representation (in interface, not necessarily in implementation) but apparently that is not true; see Neil's post and the reference manual (http://docs.python.org/ref/types.html#l2h-22). A simple example on my Python installation, apparently compiled to use UTF-16 (sys.maxunicode == 65535): >>> s = u'\u1d400' >>> s.index(s) 0 >>> s[0] u'\u1d40' >>> s == s[0] False In this case s[0] is not the full Unicode scalar, but instead just the first part of the surrogate pair consisting of 0x1D40 (in s[0]) and 0x0000 (in s[1]). -- If I have been able to see further, it was only because I stood on the shoulders of giants. -- Isaac Newton Roel Schroeven -- http://mail.python.org/mailman/listinfo/python-list