"Jason Orendorff" <[EMAIL PROTECTED]> wrote: > > On 9/15/06, Jim Jewett <[EMAIL PROTECTED]> wrote: > > There should be only one reference to a string until is constructed, > > and after that, its data should be immutable. Recoding that results > > in different bytes should not be in-place. Either it returns a new > > string (no problem) or it doesn't change the databuffer-and-encoding > > pointer until the new databuffer is fully constructed. > > Yes, but then having, say, a Latin-1 string, and repeatedly using it > in places where UTF-16 is needed, causes you to repeat the decoding > operation. The optimization becomes a pessimization. > > Here I'm imagining things like taking len(s) of a UTF-8 string, or > s==u where u happens to be UTF-16. You only have to do this once or > twice per string to start losing.
This is one of the reasons why I was talking Latin-1, UCS-2, and UCS-4: If I have a text object X whose internal representation is in UCS-2, and I have a another text object Y whose internal representation is in UCS-4, then I know X != Y. Why? Because X and Y were created with the minimal width necessary to support the code points they contain. Because Y must have a code point that X doesn't have, then X != Y. When one wants to do things like Y.startswith(X), then you actually compare the code points. - Josiah _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
