Re: Full Unicode strings strawman

Mike Samuel Mon, 16 May 2011 12:28:59 -0700

2011/5/16 Allen Wirfs-Brock <[email protected]>:
>
> On May 16, 2011, at 11:30 AM, Mike Samuel wrote:
>
>> 2011/5/16 Allen Wirfs-Brock <[email protected]>:
>>> I tried to post a pointer to this strawman on this list a few weeks ago, but
>>> apparently it didn't reach the list for some reason.
>>> Feed back would be appreciated:
>>> http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings
>>
>> Will this change the behavior of character groups in regular
>> expressions?  Would myString.match(/^.$/)[0].length ever have length
>> 2?   Would it ever match a supplemental codepoint?
>>
>
> No, supplement codepoints are  single string characters and RegExp matching 
> operates on such characters.  A string could, of course, contain character 
> sequences that correspond to UTF-8, UTF-16, or other multi-unit encodings.  
> However, from the perspective of Strings and RegExp those encodings would be 
> multiple character sequences just like they are today.  The only ES functions 
> currently proposed that would deal with multi-character encodings of 
> supplemental codepoints are the URI handling functions.  However, it may be a 
> good idea to add string-to-string UTF-8 and UTF-16 encode/decode functions 
> that simply to the encode/decode and don't have all the other processing 
> involved in encodeURI/decodeURI.


DOMString is defined at
http://www.w3.org/TR/DOM-Level-2-Core/core.html#ID-C74D1578 thus

    Type Definition DOMString
    A DOMString is a sequence of 16-bit units.

so how would round tripping a JS string through a DOM string work?

How would

    var oneSupplemental = "\U00010000";
    alert(oneSupplemental.length);  //  alerts 1
    var utf16Encoded = encodeUTF16(oneSupplemental);
    alert(utf16Encoded.length);  //  alerts 2
    var textNode = document.createTextNode(utf16Encoded);
    alert(textNode.nodeValue.length);   // alerts ?

Does the DOM need to represent utf16Encoded internally so that it can
report 2 as the length on fetch of nodeValue?  If so, how can it
represent that for systems that use a UTF-16 internal representation
for DOMString?





>
>> How would the below, which replaces orphaned surrogates with U+FFFD
>> when strings are viewed as sequences of UTF-16 code units behave?
>>
>> myString.replace( /[\ud800-\udbff](?![\udc00-\uffff])/g, "\ufffd")
>>    .replace( /(^|[^\ud800-\udbff])([\udc00-\udffff])/g, "\ufffd")
>
> Exactly as it currently does, assuming it was applied to a string that didn't 
> contain any codepoints greater than \uffff.   If the string contained any 
> codepoints > \uffff those character would not match the pattern should be 
> replaced.
>
> The important thing two keep in mind here is that under this proposal, a 
> supplemental codepoint is a single logical charater.  For example using a 
> random character that isn't in the BMP:
> "\u+02defc" === "\ud8ff\udefc";  //this is fale
> "\u+02defc".length ===1  ;//this is true
> "\ud8ff\udefc"===2;  //this is true
>
> Existing code that manipulates surrogate pairs continues to work unmodified 
> because such code is explicitly manipulating pairs of characters.  However, 
> such code might produce unexpected results if handed a string containing a 
> codepoint > \uffff .  But that takes an explicit action by someone to 
> introduce such an enhanced character into a string.
>
>
>
>>
>>
>>> Allen
>>> _______________________________________________
>>> es-discuss mailing list
>>> [email protected]
>>> https://mail.mozilla.org/listinfo/es-discuss
>>>
>>>
>
>
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode strings strawman

Reply via email to