On 9/14/06, Josiah Carlson <[EMAIL PROTECTED]> wrote: > > "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> wrote: > > Nick Coghlan <[EMAIL PROTECTED]> writes: > > > > > Only the first such call on a given string, though - the idea > > > is to use lazy decoding, not to avoid decoding altogether. > > > Most manipulations (len, indexing, slicing, concatenation, etc) > > > would require decoding to at least UCS-2 (or perhaps UCS-4). > > > > Silently optimizing string recoding might change the way recoding > > errors are reported. i.e. they might not be reported at all even > > if the string is malformed. Optimizations which change the semantics > > are bad. > > This is not a problem. During construction of the string, you would > either be recoding the original string to the standard 'compressed' > format, or if they had the same format, you would attempt a decoding, > and on failure, claim that the input wasn't in the encoding originally > specified. > > > Personally though, I'm not terribly inclined to believe that using a > 'compressed' representation of utf-8 is desireable. Why not use latin-1 > when possible, ucs-2 when latin-1 isn't enough, and ucs-4 when ucs-2 > isn't enough? You get a fixed-width character encoding, and aside from > the (annoying) need to write variants of each string function for each > width (macros would help here), or generic versions of each, you never > need to recode the initial string after it has been created. > > Even better, with a slightly modified buffer interface, these characters > can be exposed to C extensions in a somewhat transparent manner (if > desired).
The argument for UTF-8 is probably interop efficiency. Lots of C libraries, file formats, and wire protocols use UTF-8 for interchange. Verifying the validity of UTF-8 during string creation isn't that big of a deal. -bob _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
