2011/5/16 Allen Wirfs-Brock <[email protected]>: > > On May 16, 2011, at 11:30 AM, Mike Samuel wrote: > >> 2011/5/16 Allen Wirfs-Brock <[email protected]>: >>> I tried to post a pointer to this strawman on this list a few weeks ago, but >>> apparently it didn't reach the list for some reason. >>> Feed back would be appreciated: >>> http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings >> >> Will this change the behavior of character groups in regular >> expressions? Would myString.match(/^.$/)[0].length ever have length >> 2? Would it ever match a supplemental codepoint? >> > > No, supplement codepoints are single string characters and RegExp matching > operates on such characters. A string could, of course, contain character > sequences that correspond to UTF-8, UTF-16, or other multi-unit encodings. > However, from the perspective of Strings and RegExp those encodings would be > multiple character sequences just like they are today. The only ES functions > currently proposed that would deal with multi-character encodings of > supplemental codepoints are the URI handling functions. However, it may be a > good idea to add string-to-string UTF-8 and UTF-16 encode/decode functions > that simply to the encode/decode and don't have all the other processing > involved in encodeURI/decodeURI.
DOMString is defined at http://www.w3.org/TR/DOM-Level-2-Core/core.html#ID-C74D1578 thus Type Definition DOMString A DOMString is a sequence of 16-bit units. so how would round tripping a JS string through a DOM string work? How would var oneSupplemental = "\U00010000"; alert(oneSupplemental.length); // alerts 1 var utf16Encoded = encodeUTF16(oneSupplemental); alert(utf16Encoded.length); // alerts 2 var textNode = document.createTextNode(utf16Encoded); alert(textNode.nodeValue.length); // alerts ? Does the DOM need to represent utf16Encoded internally so that it can report 2 as the length on fetch of nodeValue? If so, how can it represent that for systems that use a UTF-16 internal representation for DOMString? > >> How would the below, which replaces orphaned surrogates with U+FFFD >> when strings are viewed as sequences of UTF-16 code units behave? >> >> myString.replace( /[\ud800-\udbff](?![\udc00-\uffff])/g, "\ufffd") >> .replace( /(^|[^\ud800-\udbff])([\udc00-\udffff])/g, "\ufffd") > > Exactly as it currently does, assuming it was applied to a string that didn't > contain any codepoints greater than \uffff. If the string contained any > codepoints > \uffff those character would not match the pattern should be > replaced. > > The important thing two keep in mind here is that under this proposal, a > supplemental codepoint is a single logical charater. For example using a > random character that isn't in the BMP: > "\u+02defc" === "\ud8ff\udefc"; //this is fale > "\u+02defc".length ===1 ;//this is true > "\ud8ff\udefc"===2; //this is true > > Existing code that manipulates surrogate pairs continues to work unmodified > because such code is explicitly manipulating pairs of characters. However, > such code might produce unexpected results if handed a string containing a > codepoint > \uffff . But that takes an explicit action by someone to > introduce such an enhanced character into a string. > > > >> >> >>> Allen >>> _______________________________________________ >>> es-discuss mailing list >>> [email protected] >>> https://mail.mozilla.org/listinfo/es-discuss >>> >>> > > _______________________________________________ es-discuss mailing list [email protected] https://mail.mozilla.org/listinfo/es-discuss

