"Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> wrote:
> Nick Coghlan <[EMAIL PROTECTED]> writes:
> 
> > Only the first such call on a given string, though - the idea
> > is to use lazy decoding, not to avoid decoding altogether.
> > Most manipulations (len, indexing, slicing, concatenation, etc)
> > would require decoding to at least UCS-2 (or perhaps UCS-4).
> 
> Silently optimizing string recoding might change the way recoding
> errors are reported. i.e. they might not be reported at all even
> if the string is malformed. Optimizations which change the semantics
> are bad.

This is not a problem.  During construction of the string, you would
either be recoding the original string to the standard 'compressed'
format, or if they had the same format, you would attempt a decoding,
and on failure, claim that the input wasn't in the encoding originally
specified.


Personally though, I'm not terribly inclined to believe that using a
'compressed' representation of utf-8 is desireable.  Why not use latin-1
when possible, ucs-2 when latin-1 isn't enough, and ucs-4 when ucs-2
isn't enough?  You get a fixed-width character encoding, and aside from
the (annoying) need to write variants of each string function for each
width (macros would help here), or generic versions of each, you never
need to recode the initial string after it has been created.

Even better, with a slightly modified buffer interface, these characters
can be exposed to C extensions in a somewhat transparent manner (if
desired).


 - Josiah

_______________________________________________
Python-3000 mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Reply via email to