RE: Full Unicode strings strawman

Shawn Steele Tue, 17 May 2011 11:09:35 -0700

I would much prefer changing "UCS-2" to "UTF-16", thus formalizing that 
surrogate pairs are permitted.  That'd be very difficult to break any existing 
code and would still allow representation of everything reasonable in Unicode.


That would enable Unicode, and allow extending string literals and regular 
expressions for convenience with the U+10FFFF style notation (which would be 
equivalent to the surrogate pair).  The character code manipulation functions 
could be similarly augmented without breaking anything (and maybe not needing 
different names?)

You might want to qualify the UTF-16 as allowing, but strongly discouraging, 
lone surrogates for those people who didn't realize their binary data wasn't a 
string.

The sole disadvantage would be that iterating through a string would require 
consideration of surrogates, same as today.  The same caution is also necessary 
to avoid splitting Ä (U+0041 U+0308) into its component A and   ̈ parts.  I 
wouldn't be opposed to some sort of helper functions or classes that aided in 
walking strings, preferably with options to walk the graphemes (or whatever), 
not just the surrogate pairs.  FWIW: we have such a helper for surrogates in 
.Net and "nobody uses them".  The most common feedback is that it's not that 
helpful because it doesn't deal with the graphemes.

- Shawn

[email protected]
Senior Software Design Engineer
Microsoft Windows

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

RE: Full Unicode strings strawman

Reply via email to