"Jason Orendorff" <[EMAIL PROTECTED]> wrote:
> 
> On 9/15/06, Jim Jewett <[EMAIL PROTECTED]> wrote:
> > There should be only one reference to a string until is constructed,
> > and after that, its data should be immutable.  Recoding that results
> > in different bytes should not be in-place.  Either it returns a new
> > string (no problem) or it doesn't change the databuffer-and-encoding
> > pointer until the new databuffer is fully constructed.
> 
> Yes, but then having, say, a Latin-1 string, and repeatedly using it
> in places where UTF-16 is needed, causes you to repeat the decoding
> operation.  The optimization becomes a pessimization.
> 
> Here I'm imagining things like taking len(s) of a UTF-8 string, or
> s==u where u happens to be UTF-16.  You only have to do this once or
> twice per string to start losing.

This is one of the reasons why I was talking Latin-1, UCS-2, and UCS-4:

If I have a text object X whose internal representation is in UCS-2, and
I have a another text object Y whose internal representation is in UCS-4,
then I know X != Y.  Why?  Because X and Y were created with the minimal
width necessary to support the code points they contain. Because Y must
have a code point that X doesn't have, then X != Y.

When one wants to do things like Y.startswith(X), then you actually
compare the code points.


 - Josiah

_______________________________________________
Python-3000 mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Reply via email to