Hi! I was directed here from the V8 discussion list, hope this is the right place to raise this.
I've read http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html and some of the related discussion (of which there is a considerable amount!). The problem with UTF-16 encodings has been biting me in a project where we allow untrusted users to configure our application by providing a script from which we call functions. The script is manipulating text, so it makes good sense to support full Unicode; and compatibility with older ECMAScript engines/interpreters is not a significant point. I'm fully aware that this is a major barrier to change in most situations, though; I am inclined toward some form of BRS as proposed by Brendan Eich. Some worthwhile reading: http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/ If the language provides a string type that's UTF-16 and then has a few functions that count code points (as described in the norbertlindenberg page), the temptation will be strong for programmers to ignore non-BMP characters, and then to quietly still be buggy in the face of surrogates. To truly support full Unicode, the language has to expose to its programmers *only* Unicode, not some encoding used to represent Unicode characters in memory. The easiest way to do this is to store strings as UTF-32, allowing O(1) indexing etc, but that's really wasteful. There is an alternative. Python (as of version 3.3) has implemented a new Flexible String Representation, aka PEP-393; the same has existed in Pike for some time. A string is stored in memory with a fixed number of bytes per character, based on the highest codepoint in that string - if there are any non-BMP characters, 4 bytes; if any U+0100-U+FFFF, 2 bytes; otherwise 1 byte. This depends on strings being immutable (otherwise there'd be an annoying string-copy operation when a too-large character gets put in), which is true of ECMAScript. Effectively, all strings are stored in UCS-4/UTF-32, but with the leading 0 bytes elided when they're not needed. Most scripts are going to have a large number of pure-ASCII strings in them - variable names, identifiers, HTML tags, etc. These would benefit from a switch to Pike-strings. And any strings that don't actually have astral characters in them would suffer no penalty. Only strings that are actually affected need pay the price. And we could then trust that no surrogates ever get separated during transmission. Chris Angelico _______________________________________________ es-discuss mailing list [email protected] https://mail.mozilla.org/listinfo/es-discuss

