On Thu, 25 Jul 2013 17:58:10 +1000, Chris Angelico wrote: > On Thu, Jul 25, 2013 at 5:15 PM, Steven D'Aprano > <steve+comp.lang.pyt...@pearwood.info> wrote: >> On Thu, 25 Jul 2013 04:15:42 +1000, Chris Angelico wrote: >> >>> If nobody had ever thought of doing a multi-format string >>> representation, I could well imagine the Python core devs debating >>> whether the cost of UTF-32 strings is worth the correctness and >>> consistency improvements... and most likely concluding that narrow >>> builds get abolished. And if any other language (eg ECMAScript) >>> decides to move from UTF-16 to UTF-32, I would wholeheartedly support >>> the move, even if it broke code to do so. >> >> Unfortunately, so long as most language designers are European-centric, >> there is going to be a lot of push-back against any attempt to fix >> (say) Javascript, or Java just for the sake of "a bunch of dead >> languages" in the SMPs. Thank goodness for emoji. Wait til the young >> kids start complaining that their emoticons and emoji are broken in >> Javascript, and eventually it will get fixed. It may take a decade, for >> the young kids to grow up and take over Javascript from the >> old-codgers, but it will happen. > > I don't know that that'll happen like that. Emoticons aren't broken in > Javascript - you can use them just fine. You only start seeing problems > when you index into that string. People will start to wonder why, for > instance, a "500 character maximum" field deducts two from the limit > when an emoticon goes in.
I get that. I meant *Javascript developers*, not end-users. The young kids today who become Javascript developers tomorrow will grow up in a world where they expect to be able to write band names like "▼□■□■□■" (yes, really, I didn't make that one up) and have it just work. Okay, all those characters are in the BMP, but emoji aren't, and I guarantee that even as we speak some new hipster band is trying to decide whether to name themselves "Smiling 😢" or "Crying 😊". :-) >> It is *possible* to have non-buggy string routines using UTF-16, but >> the implementation is a lot more complex than most language developers >> can be bothered with. I'm not aware of any language that uses UTF-16 >> internally that doesn't give wrong results for surrogate pairs. > > The problem isn't the underlying representation, the problem is what > gets exposed to the application. Once you've decided to expose > codepoints to the app (abstracting over your UTF-16 underlying > representation), the change to using UTF-32, or mimicking PEP 393, or > some other structure, is purely internal and an optimization. So I doubt > any language will use UTF-16 internally and UTF-32 to the app. It'd be > needlessly complex. To be honest, I don't understand what you are trying to say. What I'm trying to say is that it is possible to use UTF-16 internally, but *not* assume that every code point (character) is represented by a single 2-byte unit. For example, the len() of a UTF-16 string should not be calculated by counting the number of bytes and dividing by two. You actually need to walk the string, inspecting each double-byte: # calculate length count = 0 inside_surrogate = False for bb in buffer: # get two bytes at a time if is_lower_surrogate(bb): inside_surrogate = True continue if is_upper_surrogate(bb): if inside_surrogate: count += 1 inside_surrogate = False continue raise ValueError("missing lower surrogate") if inside_surrogate: break count += 1 if inside_surrogate: raise ValueError("missing upper surrogate") Given immutable strings, you could validate the string once, on creation, and from then on assume they are well-formed: # calculate length, assuming the string is well-formed: count = 0 skip = False for bb in buffer: # get two bytes at a time if skip: count += 1 skip = False continue if is_surrogate(bb): skip = True count += 1 String operations such as slicing become much more complex once you can no longer assume a 1:1 relationship between code points and code units, whether they are 1, 2 or 4 bytes. Most (all?) language developers don't handle that complexity, and push responsibility for it back onto the coder using the language. -- Steven -- http://mail.python.org/mailman/listinfo/python-list