On May 16, 2011, at 11:34 AM, Shawn Steele wrote:
> Thanks for making a strawman
(see my very last sentence below as it may impact the interpreation of some of
the rest of these responses)
>
> Unicode Escape Sequences
> Is it possible for U+ to accept either 4, 5, or 6 digit sequences?
> Typically when I encounter U+ notation the leading zero is omitted, and I see
> BMP characters quite often. Obviously BMP could use the U notation, however
> it seems like it’d be annoying to the occasional user to know that U is used
> for some and U+ for others. Seems like it’d be easier for developers to
> remember that U+ is “the new way” and U is “the old way that doesn’t always
> work”.
The ES string literal notation does't really accommodate variable length
subtokens without explicit terminators. What would be the rules for parsing
"\u+12345678". How do we know if the programmer meant "\u1234"+"5678" or
"\u0012"+"345678" or ...
There have been past proposals for a syntax like \u{xxxxxx} that could have 1to
6 hex digits. In the past proposal the assumption was that it would produce
UTF-16 surrogate pairs but in this context we could adopt it instead of \u+ to
produce a single character. The disadvantage is that it is a slightly long
sequence for actual large code points. On the other hand perhaps it is more
readable? "\u+123456\u+123456" vs. "\u{123456}\u{123456}" ??
>
> String Position
> It’s unclear to me if the string indices can be “changed” from UTF-16 to
> UTF-32 positions. Although UTF-32 indices are clearly desirable, I think
> that many implementations currently allow UTF-16 codepoints U+D800 through
> U+DFFF. In other words, I can already have Javascript strings with full
> Unicode range data in them. Existing applications would then have indices
> that pointed to the UTF-16, not UTF-32 index. Changing the definition of the
> index to UTF-32 would break those applications I think.
No it wouldn't break anything, at least when applied to existing data. Your
existing code is explicitly doing UTF-16 processing. Somebody had to do the
processing to create the surrogate pairs in the string. As long as you use that
same agent to are still going to bet UTF-16 encoded strings. Even though the
underlying character values could hold single characters with codepoints >
\uffff the actual string won't unless unless somebody actually constructed the
string to contain such values. That presumably doesn't happen for existing
code.
The place where existing code might break is if somebody explicitly constructs
a string (using \u+ literals or String.fromCodepoint) that contains non-BMP
characters and passes it to routines that that only expect 16-bits characters.
For this reason, any existing host routines that convert external data
resources to ES strings that contain surrogate pairs should probably continue
to do so. New routines should be provided that produce single characters
instead of pairs for non-BMP pointpoints. However, the definition of such
routines is outside the scope of the ES specification.
Finally, note that just as current strings can contain16-bit character values
that are not valid Unicode code points, the expanded full unicode strings can
also contain 21-bit character values that are not valid Unicode codepoints.
>
> You also touch on that with charCodeAt/codepointAt, which resolves the
> problem with the output type, but doesn’t address the problem with the
> indexing. Similar to the way you differentiated charCode/codepoint, it may
> be necessary to differentiate charCode/codepoint indices. IMO .fromCharCode
> doesn’t have this problem since it used to fail, but now works, which
> wouldn’t be breaking. Unless we’re concerned that now it can return a
> different UTF-16 length than before.
Again, nothing changes. Code that expects to deal with multi-character
encodings can still do so. What "magically" changes is that code that act
Unicode like codepoints are only 16-bits (ie, the code doesn't correctly deal
with surrogate pairs) will now work with full 21-bit characters.
>
> I don’t like the “21” in the name of decodeURI21.
Suggestions for better names are always welcome.
> Also, the “trick” I think, is encoding to surrogate pairs (illegally, since
> UTF8 doesn’t allow that) vs decoding to UTF16. It seems like decoding can
> safely detect input supplementary characters and properly decode them, or is
> there something about encoding that doesn’t make that state detectable?
I think I missing the distinction you are making between surrogate pairs and
UTF-16. I think I've been using the terms interchangeably. I may be munging
up the terminology.
>
> -Shawn
>
> From: [email protected] [mailto:[email protected]]
> On Behalf Of Allen Wirfs-Brock
> Sent: Monday, May 16, 2011 11:12 AM
> To: [email protected]
> Subject: Full Unicode strings strawman
>
> I tried to post a pointer to this strawman on this list a few weeks ago, but
> apparently it didn't reach the list for some reason.
>
> Feed back would be appreciated:
>
> http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings
>
>
> Allen
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss