Le jeudi 20 juin 2013 19:17:12 UTC+2, MRAB a écrit : > On 20/06/2013 17:37, Chris Angelico wrote: > > > On Fri, Jun 21, 2013 at 2:27 AM, <wxjmfa...@gmail.com> wrote: > > >> And all these coding schemes have something in common, > > >> they work all with a unique set of code points, more > > >> precisely a unique set of encoded code points (not > > >> the set of implemented code points (byte)). > > >> > > >> Just what the flexible string representation is not > > >> doing, it artificially devides unicode in subsets and try > > >> to handle eache subset differently. > > >> > > > > > > > > > UTF-16 divides Unicode into two subsets: BMP characters (encoded using > > > one 16-bit unit) and astral characters (encoded using two 16-bit units > > > in the D800::/5 netblock, or equivalent thereof). Your beloved narrow > > > builds are guilty of exactly the same crime as the hated 3.3. > > > > > UTF-8 divides Unicode into subsets which are encoded in 1, 2, 3, or 4 > > bytes, and those who previously used ASCII still need only 1 byte per > > codepoint!
Sorry, but no, it does not work in that way: confusion between the set of encoded code points and the implementation of these called code units. utf-8: how many bytes to hold an "a" in memory? one byte. flexible string representation: how many bytes to hold an "a" in memory? One byte? No, two. (Funny, it consumes more memory to hold an ascii char than ascii itself) utf-8: In a series of bytes implementing the encoded code points supposed to hold a string, picking a byte and finding to which encoded code point it belongs is a no prolem. flexible string representation: In a series of bytes implementing the encoded code points supposed to hold a string, picking a byte and finding to which encoded code point it belongs is ... impossible ! One of the cause of the bad working of this flexible string representation. The basics of any coding scheme, unicode included. jmf -- http://mail.python.org/mailman/listinfo/python-list