Re: Full Unicode strings strawman
I have read the discussion so far, but would like to come back to the strawman itself because I believe that it starts with a problem statement that's incorrect and misleading the discussion. Correctly describing the current situation would help in the discussion of possible changes, in particular their compatibility impact. The relevant portion of the problem statement: ECMAScript currently only directly supports the 16-bit basic multilingual plane (BMP) subset of Unicode which is all that existed when ECMAScript was first designed. [...] As currently defined, characters in this expanded character set cannot be used in the source code of ECMAScript programs and cannot be directly included in runtime ECMAScript string values. My reading of the ECMAScript Language Specification, edition 5.1 (January 2011), is: 1) ECMAScript allows, but does not require, implementations to support the full Unicode character set. 2) ECMAScript allows source code of ECMAScript programs to contain characters from the full Unicode character set. 3) ECMAScript requires implementations to treat String values as sequences of UTF-16 code units, and defines key functionality based on an interpretation of String values as sequences of UTF-16 code units, not based on an interpretation as sequences of Unicode code points. 4) ECMAScript prohibits implementations from conforming to the Unicode standard with regards to case conversions. The relevant text portions leading to these statements are: 1) Section 2, Conformance: A conforming implementation of this Standard shall interpret characters in conformance with the Unicode Standard, Version 3.0 or later and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the adopted encoding form, implementation level 3. If the adopted ISO/IEC 10646-1 subset is not otherwise specified, it is presumed to be the BMP subset, collection 300. If the adopted encoding form is not otherwise specified, it presumed to be the UTF-16 encoding form. To interpret this, note that the Unicode Standard, Version 3.1 was the first one to encode actual supplementary characters [1], and that the only difference between UCS-2 and UTF-16 is that UTF-16 supports supplementary characters while UCS-2 does not [2]. 2) Section 6, Source Text: ECMAScript source text is represented as a sequence of characters in the Unicode character encoding, version 3.0 or later. [...] ECMAScript source text is assumed to be a sequence of 16-bit code units for the purposes of this specification. [...] If an actual source text is encoded in a form other than 16-bit code units it must be processed as if it was first converted to UTF-16. To interpret this, note again that the Unicode Standard, Version 3.1 was the first one to encode actual supplementary characters, and that the conversion requirement enables the use of supplementary characters represented as 4-byte UTF-8 characters in source text. As UTF-8 is now the most commonly used character encoding on the web [3], the 4-byte UTF-8 representation, not Unicode escape sequences, should be seen as the normal representation of supplementary characters in ECMAScript source text. 3) Section 6, Source Text: If an actual source text is encoded in a form other than 16-bit code units it must be processed as if it was first converted to UTF-16. [...] Throughout the rest of this document, the phrase “code unit” and the word “character” will be used to refer to a 16-bit unsigned value used to represent a single 16-bit unit of text. Section 15.5.4.4, String.prototype.charCodeAt(pos): Returns a Number (a nonnegative integer less than 2**16) representing the code unit value of the character at position pos in the String resulting from converting this object to a String. Section 15.5.5.1 length: The number of characters in the String value represented by this String object. I don't like that the specification redefines a commonly used term such as character to mean something quite different (code unit), and hides that redefinition in a section on source text while applying it primarily to runtime behavior. But there it is: Thanks to the redefinition, it's clear that charCodeAt() returns UTF-16 code units, and that the length property holds the number of UTF-16 code units in the string. 4) Section 15.5.4.16, String.prototype.toLowerCase(): For the purposes of this operation, the 16-bit code units of the Strings are treated as code points in the Unicode Basic Multilingual Plane. Surrogate code points are directly transferred from S to L without any mapping. This does not meet Conformance Requirement C8 of the Unicode Standard, Version 6.0 [4]: When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall interpret that code unit sequence according to the corresponding code point sequence. References: [1]
RE: Use cases for WeakMap
This is all a bit off topic but performance does matter and folks seem to be underestimating the wealth of community knowledge that exists in this area. Who underestimates? Sorry, this wasn't meant to slight anyone. I have spent a career standing on the shoulders of Allen and his colleagues. My respect should not be underestimated. Interesting pointer. -Rick From: Brendan Eich [mailto:bren...@mozilla.com] Sent: Monday, May 16, 2011 6:44 PM To: Hudson, Rick Cc: Allen Wirfs-Brock; Oliver Hunt; Andreas Gal; es-discuss Subject: Re: Use cases for WeakMap On May 16, 2011, at 2:46 PM, Hudson, Rick wrote: This is all a bit off topic but performance does matter and folks seem to be underestimating the wealth of community knowledge that exists in this area. Who underestimates? A bunch of us are aware of all this. Allen certainly knows all about it, and we've been talking shop with him for years, long before he joined Mozilla :-P. I recall a conversation like this one about sparse hashcode implementation with Allen, Lars Thomas Hansen (then of Opera), and Graydon Hoare from four or five years ago... http://wiki.ecmascript.org/doku.php?id=proposals:hashcodes (check the history) However, in this thread, the issue is not optimizing hashcode or other metadata sparsely associated with objects. That's a good thing, implementations should do it. Having the hashcode in the object wins, compared to having it (initially) in a side table, but who's counting? The issue under dispute was neither sparse hashcode nor sparse fish property association, where the property would be accessed by JS user code that referenced the containing object itself. Rather, it was whether a frozen object needed any hidden mutable state to be a key in a WeakMap. And since this state would be manipulated by the GC, it matters if it's in the object, since the GC would be touching more potentially randomly distributed memory, thrashing more cache. So far as I can tell, there's no demonstrated need for this hidden-mutable key-in-weakmap object state. And it does seem that touching key objects unnecessarily will hurt weakmap-aware GC performance. But I may be underestimating... :-/ /be ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode strings strawman
On 16 May 2011 17:42, Boris Zbarsky bzbar...@mit.edu wrote: On 5/16/11 4:38 PM, Wes Garland wrote: Two great things about strings composed of Unicode code points: ... If though this is a breaking change from ES-5, I support it whole-heartedly but I expect breakage to be very limited. Provided that the implementation does not restrict the storage of reserved code points (D800-DF00) Those aren't code points at all. They're just not Unicode. Not quite: code points D800-DFFF are reserved code points which are not representable with UTF-16. Definition D71, Unicode 6.0. If you allow storage of such, then you're allowing mixing Unicode strings and something else (whatever the something else is), with bad most likely bad results. I don't believe this is true. We are merely allowing storage of Unicode strings which cannot be converted into UTF-16. That allows us to maintain most of the existing String behaviour (arbitrary array of uint16), although overflowing like this would break: a = String.fromCharCode(str.charCodeAt(0) + 1) when str[0] is 0+. Most simply, assignign a DOMString containing surrogates to a JS string should collapse the surrogate pairs into the corresponding codepoint if JS strings really contain codepoints... The only way to make this work is if either DOMString is redefined or DOMString and full Unicode strings are different kinds of objects. Users doing surrogate pair decomposition will probably find that their code just works How, exactly? /** Untested and not rigourous */ function unicode_strlen(validUnicodeString) { var length = 0; for (var i = 0; i validUnicodeString.length; i++) { if (validUnicodeString.charCodeAt(i) = 0xd800 validUnicodeString.charCodeAt(i) = 0xdc00) continue; length++; } return length; } Code like this which looks for surrogate pairs in valid Unicode strings will simply not find them, instead only finding code points which seem to the same size as the code unit. Users creating Strings with surrogate pairs will need to re-tool Such users would include the DOM, right? I am hopeful that most web browsers have one or few interfaces between DOM strings and JS strings. I do not know if my hopes reflect reality. but this is a small burden and these users will be at the upper strata of Unicode-foodom. You're talking every single web developer here. Or at least every single web developer who wants to work with Devanagari text. I don't think so. I bet if we could survey web developers across the industry (rather than just top-tier people who tend to participate in discussions like this one), we would find that the vast major of them never both handling non-BMP cases, and do not test non-BMP cases. Heck, I don't even know if a non-BMP character can be data-entered into an input type=text maxlength=1 or not. (Do you? What happens?) I suspect that 99.99% of users will find that this change will fix bugs in their code when dealing with non-BMP characters. Not unless DOMString is changed or the interaction between the two very carefully defined in failure-proof ways. Yes, I was dismayed to find out that DOMString defines UTF-16. We could get away with converting UTF-16 at DOMString JSString transition point. This might mean that it is possible that JSString=DOMString would throw, as full Unicode Strings could contain code points which are not representable in UTF-16. If don't throw on invalid-in-UTF-16 code points, then round-tripping is lossy. If it does, that's silly. It needed to specify _something_, and UTF-16 was the thing that was compatible with how scripts work in ES. Not to mention the Java legacy if the DOM... By this comment, I am inferring then that DOM and JS Strings share their backing store. From an API-cleanliness point of view, that's too bad. From an implementation POV, it makes sense. Actually, it makes even more sense when I recall the discussion we had last week when you explained how external strings etc work in SpiderMonkey/Gecko. Do all the browsers share JS/DOM String backing stores? It is an unfortunate accident of history that UTF-16 surrogate pairs leak their abstraction into ES Strings, and I believe it is high time we fixed that. If you can do that without breaking web pages, great. If not, then we need to talk. ;) There is no question in mind that this proposal would break Unicode-aware JS. It is my belief that that doesn't matter if it accompanies other major, opt-in changes. Resolving DOM String JS String interchange is a little trickier, but I think it can be managed if we can allow JS=DOM to throw when high surrogate code points are encountered in the JS String. It might mean extra copying, or it might not if the DOM implementation already uses UTF-8 internally. Wes -- Wesley W. Garland Director, Product Development PageMail, Inc. +1 613 542 2787 x 102
Re: arrow syntax unnecessary and the idea that function is too long
On May 16, 2011, at 7:55 PM, Peter Michaux wrote: On Mon, May 9, 2011 at 6:02 PM, Brendan Eich bren...@mozilla.com wrote: Yes, and we could add call/cc to make (some) compiler writers even happier. But users would shoot off all their toes with this footgun, and some implementors would be hard-pressed to support it. The point is *not * to do any one change that maximizes benefits to some parties while harming others. By the nature of their task and its complexity, compiler writers targeting JavaScript need JavaScript to have features that make it possible to generate efficient compiled code. Without the big features like call/cc there are many things that just cannot be compiled well enough...which ultimately means all the languages that compile to JavaScript are just thin sugar layers that really aren't even worth the bother. Those languages, like Coffeescript, are obscure and known by only a few people. Rails 3.1 is obscure and known by only a few people? Seriously, check your attitude. There are many languages in the world. Asserting that Python implemented via Skulpt is not worth the bother is insulting to people working on that project and using it. If you don't think it's worth the bother, feel free not to bother. Your opinion does not become an imperative to add arbitrary compiler-target wishlist items. Here's a list of languages that compile to JS: https://github.com/jashkenas/coffee-script/wiki/List-of-languages-that-compile-to-JS I'm sure it's not complete. The goal of pleasing compiler writers should be to make it possible to compile existing languages like Perl, Ruby, Python and Scheme to JavaScript. These languages are the ones that people know and really want to use and target their compilers to JavaScript. This is not a straight-up discussion. You ignore safety, never mind usability. Compiler writers want unsafe interfaces to machine-level abstractions. Should we expose them? Certainly not, even though not exposing them hurts efforts to compile (not transpile, as you note) other languages to JS. Too bad -- the first order of business is JS as a source language. Being a better target for compilers is secondary. It is among the goals, but not super-ordinate. http://wiki.ecmascript.org/doku.php?id=harmony:harmony Compiler-writers don't seem to be having such a bad time of it, and we can proceed on a more concrete requirements proposal basis than taking absolute-sounding philosophical stances. /be ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode strings strawman
On 5/17/11 10:40 AM, Wes Garland wrote: On 16 May 2011 17:42, Boris Zbarsky bzbar...@mit.edu Those aren't code points at all. They're just not Unicode. Not quite: code points D800-DFFF are reserved code points which are not representable with UTF-16. Nor with any other Unicode encoding, really. They don't represent, on their own, Unicode characters. If you allow storage of such, then you're allowing mixing Unicode strings and something else (whatever the something else is), with bad most likely bad results. I don't believe this is true. We are merely allowing storage of Unicode strings which cannot be converted into UTF-16. No, you're allowing storage of some sort of number arrays that don't represent Unicode strings at all. Users doing surrogate pair decomposition will probably find that their code just works How, exactly? /** Untested and not rigourous */ function unicode_strlen(validUnicodeString) { var length = 0; for (var i = 0; i validUnicodeString.length; i++) { if (validUnicodeString.charCodeAt(i) = 0xd800 validUnicodeString.charCodeAt(i) = 0xdc00) continue; length++; } return length; } Code like this which looks for surrogate pairs in valid Unicode strings will simply not find them, instead only finding code points which seem to the same size as the code unit. Right, so if it's looking for non-BMP characters in the string, say, instead of computing the length, it won't find them. How the heck is that just works? Users creating Strings with surrogate pairs will need to re-tool Such users would include the DOM, right? I am hopeful that most web browsers have one or few interfaces between DOM strings and JS strings. A number of web browsers have an interface between DOM and JS strings that consists of either memcpy or addref the buffer. I do not know if my hopes reflect reality. They probably do, so you're only really talking about at least 10 different places across at least 5 different codebases that have to be fixed, in a coordinated way... You're talking every single web developer here. Or at least every single web developer who wants to work with Devanagari text. I don't think so. I bet if we could survey web developers across the industry (rather than just top-tier people who tend to participate in discussions like this one), we would find that the vast major of them never both handling non-BMP cases, and do not test non-BMP cases. And how many of them use libraries that handle that for them? And how many implicitly rely on DOM-to-JS roundtripping without explicitly doing anything with non-BMP chars or surrogate pairs? Heck, I don't even know if a non-BMP character can be data-entered into an input type=text maxlength=1 or not. (Do you? What happens?) It cannot in Gecko, as I recall; there maxlength is interpreted as number of UTF-16 code units. In WebKit, maxlength is interpreted as the number of grapheme clusters based on my look at their code just now. I don't know offhand about Presto and Trident, for obvious reasons. We could get away with converting UTF-16 at DOMString JSString transition point. What would that even mean? DOMString is defined to be an ES string in the ES binding right now. Is the proposal to have some other kind of object for DOMString (so that, for example, String.prototype would no longer affect the behavior of DOMString the way it does now)? This might mean that it is possible that JSString=DOMString would throw, as full Unicode Strings could contain code points which are not representable in UTF-16. How is that different from sticking non-UTF-16 into an ES string right now? If don't throw on invalid-in-UTF-16 code points, then round-tripping is lossy. If it does, that's silly. So both options suck, yes? ;) It needed to specify _something_, and UTF-16 was the thing that was compatible with how scripts work in ES. Not to mention the Java legacy if the DOM... By this comment, I am inferring then that DOM and JS Strings share their backing store. That's not what the comment was about, actually. The comment was about API. But yes, in many cases they do share backing store. Do all the browsers share JS/DOM String backing stores? Gecko does in some cases. WebKit+JSC does in all cases, I believe (or at least a large majority of cases). I don't know about others. There is no question in mind that this proposal would break Unicode-aware JS. As far as I can tell it would also break Unicode-unaware JS. It is my belief that that doesn't matter if it accompanies other major, opt-in changes. It it's opt-in, perhaps. Resolving DOM String JS String interchange is a little trickier, but I think it can be managed if we can allow JS=DOM to throw when high surrogate code points are encountered in the JS String. I'm 99% sure this would break sites. It might mean
Re: Full Unicode strings strawman
On May 16, 2011, at 8:13 PM, Allen Wirfs-Brock wrote: I think it does. In another reply I also mentioned the possibility of tagging in a JS visible manner strings that have gone through a known encoding process. Saw that, seems helpful. Want to spec it? If the strings you are combining from different sources have not been canonicalize to a common encoding then you better be damn care how you combine them. Programmers miss this as you note, so arguably things are not much worse, at best no worse, with your proposal. Your strawman does change the game, though, hence the global or cross-cutting (non-modular) concern. I'm warm to it, after digesting. It's about time we get past the 90's! The DOM seems seems to canonicalize to UTF-16 (with some slop WRT invalid encoding that Boris and others have pointed out). I don't about other sources such as XMLHttpRequest or the file APIs. However, in the long run JS in the browser is going to have to be able to deal with arbitrary encodings. You can hide such things form many programmers but not all. After all, people actually have to implement transcoders. Transcoding to some canonical Unicode representation is often done by the browser upstream of JS, and that's a good thing. Declarative specification by authors, implementation by relative-few browser i18n gurus, sparing the many JS devs the need to worry. This is good, I claim. That it means JS hackers are careless about Unicode is inevitable, and there are other reasons for that condition anyway. At least with your strawman there will be full Unicode flowing through JS and back into the DOM and layout. /be ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode strings strawman
On 5/17/11 1:05 PM, Brendan Eich wrote: If the strings you are combining from different sources have not been canonicalize to a common encoding then you better be damn care how you combine them. Programmers miss this as you note, so arguably things are not much worse, at best no worse, with your proposal. Right now, by the time a string gets into JS in browsers it's been canonicalized into UTF-16, to the best of the browser's ability, unless you explicitly tell it otherwise (e.g. with the user-defined charset hackery on XHR). The DOM seems seems to canonicalize to UTF-16 (with some slop WRT invalid encoding that Boris and others have pointed out). I don't about other sources such as XMLHttpRequest or the file APIs. However, in the long run JS in the browser is going to have to be able to deal with arbitrary encodings. You can hide such things form many programmers but not all. After all, people actually have to implement transcoders. Transcoding to some canonical Unicode representation is often done by the browser upstream of JS, and that's a good thing. Declarative specification by authors, implementation by relative-few browser i18n gurus, sparing the many JS devs the need to worry. This is good, I claim. Yes. And right now that's how it works and actual JS authors typically don't have to worry about encoding issues. I don't agree with Allen's claim that in the long run JS in the browser is going to have to be able to deal with arbitrary encodings. Having the _capability_ might be nice, but forcing all web authors to think about it seems like a non-starter. That it means JS hackers are careless about Unicode is inevitable, and there are other reasons for that condition anyway. At least with your strawman there will be full Unicode flowing through JS and back into the DOM and layout. See, this is the part I don't follow. What do you mean by full Unicode and how do you envision it flowing? -Boris ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode strings strawman
On May 17, 2011, at 10:22 AM, Boris Zbarsky wrote: Yes. And right now that's how it works and actual JS authors typically don't have to worry about encoding issues. I don't agree with Allen's claim that in the long run JS in the browser is going to have to be able to deal with arbitrary encodings. Having the _capability_ might be nice, but forcing all web authors to think about it seems like a non-starter. Allen said be able to, not forcing. Big difference. I think we three at least are in agreement here. That it means JS hackers are careless about Unicode is inevitable, and there are other reasons for that condition anyway. At least with your strawman there will be full Unicode flowing through JS and back into the DOM and layout. See, this is the part I don't follow. What do you mean by full Unicode and how do you envision it flowing? I mean UTF-16 flowing through, but as you say that happens now -- but (I reply) only if JS doesn't mess with things in a UCS-2 way (indexing 16-bits at a time, ignoring surrogates). And JS code does generally assume 16 bits are enough. With Allen's proposal we'll finally have some new APIs for JS developers to use. /be ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode strings strawman
On 5/17/11 1:27 PM, Brendan Eich wrote: On May 17, 2011, at 10:22 AM, Boris Zbarsky wrote: Yes. And right now that's how it works and actual JS authors typically don't have to worry about encoding issues. I don't agree with Allen's claim that in the long run JS in the browser is going to have to be able to deal with arbitrary encodings. Having the _capability_ might be nice, but forcing all web authors to think about it seems like a non-starter. Allen said be able to, not forcing. Big difference. I think we three at least are in agreement here. I think we're in agreement on the sentiment, but perhaps not on where on the able to to forcing spectrum this strawman falls. See, this is the part I don't follow. What do you mean by full Unicode and how do you envision it flowing? I mean UTF-16 flowing through, but as you say that happens now -- but (I reply) only if JS doesn't mess with things in a UCS-2 way (indexing 16-bits at a time, ignoring surrogates). And JS code does generally assume 16 bits are enough. With Allen's proposal we'll finally have some new APIs for JS developers to use. That doesn't answer my questions -Boris ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode strings strawman
On May 17, 2011, at 10:37 AM, Boris Zbarsky wrote: On 5/17/11 1:27 PM, Brendan Eich wrote: On May 17, 2011, at 10:22 AM, Boris Zbarsky wrote: Yes. And right now that's how it works and actual JS authors typically don't have to worry about encoding issues. I don't agree with Allen's claim that in the long run JS in the browser is going to have to be able to deal with arbitrary encodings. Having the _capability_ might be nice, but forcing all web authors to think about it seems like a non-starter. Allen said be able to, not forcing. Big difference. I think we three at least are in agreement here. I think we're in agreement on the sentiment, but perhaps not on where on the able to to forcing spectrum this strawman falls. Where do you read forcing? Not in the words you cited. See, this is the part I don't follow. What do you mean by full Unicode and how do you envision it flowing? I mean UTF-16 flowing through, but as you say that happens now -- but (I reply) only if JS doesn't mess with things in a UCS-2 way (indexing 16-bits at a time, ignoring surrogates). And JS code does generally assume 16 bits are enough. With Allen's proposal we'll finally have some new APIs for JS developers to use. That doesn't answer my questions Ok, full Unicode means non-BMP characters not being wrongly treated as two uint16 units and miscounted, separated or partly deleted by splicing and slicing, etc. IOW, JS grows to treat strings as full Unicode, not uint16 vectors. This is a big deal! Hope this helps, /be ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode strings strawman
On 5/17/11 1:40 PM, Brendan Eich wrote: On May 17, 2011, at 10:37 AM, Boris Zbarsky wrote: On 5/17/11 1:27 PM, Brendan Eich wrote: On May 17, 2011, at 10:22 AM, Boris Zbarsky wrote: Yes. And right now that's how it works and actual JS authors typically don't have to worry about encoding issues. I don't agree with Allen's claim that in the long run JS in the browser is going to have to be able to deal with arbitrary encodings. Having the _capability_ might be nice, but forcing all web authors to think about it seems like a non-starter. Allen said be able to, not forcing. Big difference. I think we three at least are in agreement here. I think we're in agreement on the sentiment, but perhaps not on where on the able to to forcing spectrum this strawman falls. Where do you read forcing? Not in the words you cited. In the substance of having strings in different encodings around at the same time. If that doesn't force developers to worry about encodings, what does, exactly? I mean UTF-16 flowing through, but as you say that happens now -- but (I reply) only if JS doesn't mess with things in a UCS-2 way (indexing 16-bits at a time, ignoring surrogates). And JS code does generally assume 16 bits are enough. With Allen's proposal we'll finally have some new APIs for JS developers to use. That doesn't answer my questions Ok, full Unicode means non-BMP characters not being wrongly treated as two uint16 units and miscounted, separated or partly deleted by splicing and slicing, etc. IOW, JS grows to treat strings as full Unicode, not uint16 vectors. This is a big deal! OK, but still allows sticking non-Unicode gunk into the strings, right? So they're still vectors of something. Whatever that something is. Hope this helps, Halfway. The DOM interaction questions remain unanswered. Seriously, I think we should try to make a list of the issues there, the pitfalls that would arise for web developers as a result, then go through and see how and whether to address them. Then we'll have a good basis for considering the web compat impact -Boris ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode strings strawman
On May 17, 2011, at 10:43 AM, Boris Zbarsky wrote: On 5/17/11 1:40 PM, Brendan Eich wrote: Where do you read forcing? Not in the words you cited. In the substance of having strings in different encodings around at the same time. If that doesn't force developers to worry about encodings, what does, exactly? Where in the strawman is anything of that kind observably (to JS authors) proposed? Ok, full Unicode means non-BMP characters not being wrongly treated as two uint16 units and miscounted, separated or partly deleted by splicing and slicing, etc. IOW, JS grows to treat strings as full Unicode, not uint16 vectors. This is a big deal! OK, but still allows sticking non-Unicode gunk into the strings, right? So they're still vectors of something. Whatever that something is. Yes, old APIs for building strings, e.g. String.fromCharCode, still build gunk strings, aka uint16 data hacked into strings. New APIs for characters. This has to apply to internal JS engine / DOM implemnetation APIs as needed, too. Hope this helps, Halfway. The DOM interaction questions remain unanswered. Seriously, I think we should try to make a list of the issues there, the pitfalls that would arise for web developers as a result, then go through and see how and whether to address them. Then we'll have a good basis for considering the web compat impact Good idea. /be ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode strings strawman
On May 17, 2011, at 10:47 AM, Brendan Eich wrote: On May 17, 2011, at 10:43 AM, Boris Zbarsky wrote: On 5/17/11 1:40 PM, Brendan Eich wrote: Where do you read forcing? Not in the words you cited. In the substance of having strings in different encodings around at the same time. If that doesn't force developers to worry about encodings, what does, exactly? Where in the strawman is anything of that kind observably (to JS authors) proposed? The flag idea just mooted in this thread is not addressing new problem -- we can have such mixing bugs today. True, the odds may go up for such bugs in the future (hard to assess whether or how much). At least with new APIs for characters not gunk-units, we can detect mixtures dynamically. This still seems a good idea but it is not essential (yet) and it is nowhere near forcing developers to worry about encodings. /be ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: arrow syntax unnecessary and the idea that function is too long
On May 17, 2011, at 4:57, Peter Michaux petermich...@gmail.com wrote: The goal of pleasing compiler writers should be to make it possible to compile existing languages like Perl, Ruby, Python and Scheme to JavaScript. These languages are the ones that people know and really want to use and target their compilers to JavaScript. You sound like you really hate JavaScript and can’t imagine working with it unless some other language is compiled to it. I’ve programmed quite a bit of Perl, Python, and Scheme and found that once you get to know the proverbial “good parts” of JavaScript, it can be quite elegant. That is, I don’t miss either of these three languages, except maybe for Python’s runtime library (and Java’s tools, but that’s a different topic). With the increasing momentum behind JavaScript, IMHO the primary goal should be to improve the language for people who actually want to program in it. This is difficult enough, given all the parties that have to be pleased. Listening to feedback from compiler writers should be a secondary goal. -- Dr. Axel Rauschmayer a...@rauschma.de twitter.com/rauschma home: rauschma.de blog: 2ality.com ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
RE: Full Unicode strings strawman
I would much prefer changing UCS-2 to UTF-16, thus formalizing that surrogate pairs are permitted. That'd be very difficult to break any existing code and would still allow representation of everything reasonable in Unicode. That would enable Unicode, and allow extending string literals and regular expressions for convenience with the U+10 style notation (which would be equivalent to the surrogate pair). The character code manipulation functions could be similarly augmented without breaking anything (and maybe not needing different names?) You might want to qualify the UTF-16 as allowing, but strongly discouraging, lone surrogates for those people who didn't realize their binary data wasn't a string. The sole disadvantage would be that iterating through a string would require consideration of surrogates, same as today. The same caution is also necessary to avoid splitting Ä (U+0041 U+0308) into its component A and ̈ parts. I wouldn't be opposed to some sort of helper functions or classes that aided in walking strings, preferably with options to walk the graphemes (or whatever), not just the surrogate pairs. FWIW: we have such a helper for surrogates in .Net and nobody uses them. The most common feedback is that it's not that helpful because it doesn't deal with the graphemes. - Shawn shawn.ste...@microsoft.com Senior Software Design Engineer Microsoft Windows ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode strings strawman
On 5/17/11 1:47 PM, Brendan Eich wrote: On May 17, 2011, at 10:43 AM, Boris Zbarsky wrote: On 5/17/11 1:40 PM, Brendan Eich wrote: Where do you read forcing? Not in the words you cited. In the substance of having strings in different encodings around at the same time. If that doesn't force developers to worry about encodings, what does, exactly? Where in the strawman is anything of that kind observably (to JS authors) proposed? The strawman is silent on the matter. It was proposed by Allen in the discussion about how the strawman interacts with the DOM. -Boris ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode strings strawman
On 17 May 2011 12:36, Boris Zbarsky bzbar...@mit.edu wrote: Not quite: code points D800-DFFF are reserved code points which are not representable with UTF-16. Nor with any other Unicode encoding, really. They don't represent, on their own, Unicode characters. Right - but they are still legitimate code points, and they fill out the space required to let us treat String as uint16[] when defining the backing store as something that maps to the set of all Unicode code points. That said, you can encode these code points with utf-8; for example, 0xdc08 becomes 0xed 0xb0 0x88. No, you're allowing storage of some sort of number arrays that don't represent Unicode strings at all. No, if I understand Allen's proposal correctly, we're allowing storage of some sort of number arrays that may contain reserved code points, some of which cannot be represented in UTF-16. This isn't that different from the status quo; it is possible right now to generate JS Strings which are not valid UTF-16 by creating invalid surrogate pairs. Keep in mind, also, that even a sequence of random bytes is a valid Unicode string. The standard does not require that they be well-formed. (D80) Right, so if it's looking for non-BMP characters in the string, say, instead of computing the length, it won't find them. How the heck is that just works? My untested hypothesis is that the vast majority of JS code looking for non-BMP characters is looking for them in order to call them out for special processing, because the code unit and code point size are different. When they don't need special processing, they don't need to be found. Since the high-surrogate code points do not appear in well-formed Unicode strings, they will not be found, and the unneeded special processing will not happen. This train of clauses forms the basis for my opinion that, for the majority of folks, things will just work. What would that even mean? DOMString is defined to be an ES string in the ES binding right now. Is the proposal to have some other kind of object for DOMString (so that, for example, String.prototype would no longer affect the behavior of DOMString the way it does now)? Wait, are DOMStrings formally UTF-16, or are they ES Strings? This might mean that it is possible that JSString=DOMString would throw, as full Unicode Strings could contain code points which are not representable in UTF-16. How is that different from sticking non-UTF-16 into an ES string right now? Currently, JS Strings are effectively arrays of 16-bit code units, which are indistinguishable from 16-bit Unicode strings (D82). This means that a JS application can use JS Strings as arrays of uint16, and expect to be able to round-trip all strings, even those which are not well-formed, through a UTF-16 DOM. If we redefine JS Strings to be arrays of Unicode code points, then the JS application can use JS Strings as arrays uint21 -- but round-tripping the high-surrogate code points through a UTF-16 layer would not work. It might mean extra copying, or it might not if the DOM implementation already uses UTF-8 internally. Uh... what does UTF-8 have to do with this? If you're already storing UTF-8 strings internally, then you are already doing something expensive (like copying) to get their code units into and out of JS; so no incremental perf impact by not having a common UTF-16 backing store. (As a note, Gecko and WebKit both use UTF-16 internally; I would be _really_ surprised if Trident does not. No idea about Presto.) FWIW - last I time I scanned the v8 sources, it appeared to use a three-representation class, which could store either ASCII, UCS2, or UTF-8. Presumably ASCII could also be ISO-Latin-1, as both are exact, naive, byte-sized UCS2/UTF-16 subsets. Wes -- Wesley W. Garland Director, Product Development PageMail, Inc. +1 613 542 2787 x 102 ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode strings strawman
On 5/17/11 2:12 PM, Wes Garland wrote: That said, you can encode these code points with utf-8; for example, 0xdc08 becomes 0xed 0xb0 0x88. By the same argument, you can encode them in UTF-16. The byte sequence above is not valid UTF-8. See How do I convert an unpaired UTF-16 surrogate to UTF-8? at http://unicode.org/faq/utf_bom.html which says: A different issue arises if an unpaired surrogate is encountered when converting ill-formed UTF-16 data. By represented such an unpaired surrogate on its own as a 3-byte sequence, the resulting UTF-8 data stream would become ill-formed. While it faithfully reflects the nature of the input, Unicode conformance requires that encoding form conversion always results in valid data stream. Therefore a converter must treat this as an error. (fwiw, this is the third hit on Google for utf-8 surrogates right after the Wikipedia articles on UTF-8 and UTF-16, so it's not like it's hard to find this information). No, you're allowing storage of some sort of number arrays that don't represent Unicode strings at all. No, if I understand Allen's proposal correctly, we're allowing storage of some sort of number arrays that may contain reserved code points, some of which cannot be represented in UTF-16. See above. You're allowing number arrays that may or may not be interpretable as Unicode strings, period. This isn't that different from the status quo; it is possible right now to generate JS Strings which are not valid UTF-16 by creating invalid surrogate pairs. True. However right now no one is pretending that strings are anything other than arrays of 16-bit units. Keep in mind, also, that even a sequence of random bytes is a valid Unicode string. The standard does not require that they be well-formed. (D80) Uh... A sequence of _bytes_ is not anything related to Unicode unless you know how it's encoded. Not sure what (D80) is supposed to mean. Right, so if it's looking for non-BMP characters in the string, say, instead of computing the length, it won't find them. How the heck is that just works? My untested hypothesis is that the vast majority of JS code looking for non-BMP characters is looking for them in order to call them out for special processing, because the code unit and code point size are different. When they don't need special processing, they don't need to be found. This hypothesis is worth testing before being blindly inflicted on the web. What would that even mean? DOMString is defined to be an ES string in the ES binding right now. Is the proposal to have some other kind of object for DOMString (so that, for example, String.prototype would no longer affect the behavior of DOMString the way it does now)? Wait, are DOMStrings formally UTF-16, or are they ES Strings? DOMStrings are formally UTF-16 in the DOM spec. They are defined to be ES strings in the ES binding for the DOM. Please be careful to not confuse the DOM and its language bindings. One could change the ES binding to use a non-ES-string object to preserve the DOM's requirement that strings be sequences of UTF-16 code units. I'd expect this would break the web unless one is really careful doing it... How is that different from sticking non-UTF-16 into an ES string right now? Currently, JS Strings are effectively arrays of 16-bit code units, which are indistinguishable from 16-bit Unicode strings Yes. (D82) ? This means that a JS application can use JS Strings as arrays of uint16, and expect to be able to round-trip all strings, even those which are not well-formed, through a UTF-16 DOM. Yep. And they do. If we redefine JS Strings to be arrays of Unicode code points, then the JS application can use JS Strings as arrays uint21 -- but round-tripping the high-surrogate code points through a UTF-16 layer would not work. OK, that seems like a breaking change. It might mean extra copying, or it might not if the DOM implementation already uses UTF-8 internally. Uh... what does UTF-8 have to do with this? If you're already storing UTF-8 strings internally, then you are already doing something expensive (like copying) to get their code units into and out of JS Maybe, and maybe not. We (Mozilla) have had some proposals to actually use UTF-8 throughout, including in the JS engine; it's quite possible to implement an API that looks like a 16-bit array on top of UTF-8 as long as you allow invalid UTF-8 that's needed to represent surrogates and the like. (As a note, Gecko and WebKit both use UTF-16 internally; I would be _really_ surprised if Trident does not. No idea about Presto.) FWIW - last I time I scanned the v8 sources, it appeared to use a three-representation class, which could store either ASCII, UCS2, or UTF-8. Presumably ASCII could also be ISO-Latin-1, as both are exact, naive, byte-sized UCS2/UTF-16 subsets. There's a
Re: Full Unicode strings strawman
On 5/17/11 2:24 PM, Allen Wirfs-Brock wrote: In the substance of having strings in different encodings around at the same time. If that doesn't force developers to worry about encodings, what does, exactly? This already occurs in JS. For example, the encodeURI function produces a string whose character are the UTF-8 encoding of a UTF-16 string (including recognition of surrogate pairs). Last I checked, encodeURI output a pure ASCII string. Am I just missing something? The ASCII string happens to be the %-escaping of the UTF-8 representation of the Unicode string you get by assuming that the initial JS string is a UTF-16 representation of said Unicode string. But at no point here is the author dealing with UTF-8. OK, but still allows sticking non-Unicode gunk into the strings, right? So they're still vectors of something. Whatever that something is. Conceptually unsigned 32-bit values. The actual internal representation is likely to be something else. I don't care about the internal representation; I'm interested in the author-observable behavior. Interpretation of those values is left to the functions (both built-in and application) that operate upon them. OK. That includes user-written functions, of course, which currently only have to deal with UTF-16 (and maybe UCS-2 if you want to be very pedantic). Most built-in string methods do not apply any interpretation and will happily process strings as vectors of arbitrary uint32 values. Some built-ins (encodeURI/decodeURI, toUpperCase/toLowerCase) explicitly deal with Unicode characters or various Unicode encodings and these have to be explicitly defined to deal with non-Unicode character values or invalid encodes. That seems fine. This is not where problems lie. These functions already are defined for ES5 in this manner WRT the representation of strings as vectors of arbitrary uint16 values. Yes, sure. -Boris ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
RE: Full Unicode strings strawman
Note: The W3C Internationalization Core WG published a set of requirements in this area for consideration by ES some time ago. It lives here: http://www.w3.org/International/wiki/JavaScriptInternationalization The section on 'locale related behavior' is being separately addressed. I think that: 1. Changing references from UCS-2 to UTF-16 makes sense, although the spec, IIRC, already *says* UTF-16. 2. Allowing unpaired surrogates is a *requirement*. Yes, such a string is ill-formed, but there are too many cases in which one might wish to have such broken strings for scripting purposes. 3. We should have escape syntax for supplementary characters (such as \U001). Looking up the surrogate pair for a given Unicode character is extremely inconvenient and is not self-documenting. As Shawn notes, basically, there are three ways that one might wish to access strings: - as grapheme clusters (visual units of text) - as Unicode scalar values (logical units of text, i.e. characters) - as code units (encoding units of text) The example I use in the Unicode conference internationalization tutorial is a box on a Web site with an ES controlled message underneath it saying You have 200 characters remaining. I think it is instructive to look at how Java managed this transition. In some cases the 200 represents the number of storage units I have available (as in my backing database), in which case String.length is what I probably want. In some cases I want to know how many Unicode characters there are (Java solves this with the codePointCount(), codePointBefore(), and codePointAt() methods). These are relatively rare operations, but they have occasional utility. Or I may want grapheme clusters (Java attempts to solve this with BreakIterators and I tend to favor doing the same thing in JavaScript---default grapheme clusters are better than nothing, but language-specific grapheme clusters are more useful). If we follow the above, providing only minimal additional methods for accessing codepoints when necessary, this also limits the impact of adding supplementary character support to the language. Regex probably works the way one supposes (both \U001 and \ud800\udc00 find the surrogate pair \ud800\udc00 and one can still find the low surrogate \udc00 if one wishes too). And existing scripts will continue to function without alteration. However, new scripts can be written that use supplementary characters. Regards, Addison Addison Phillips Globalization Architect (Lab126) Chair (W3C I18N WG) Internationalization is not a feature. It is an architecture. -Original Message- From: Shawn Steele [mailto:shawn.ste...@microsoft.com] Sent: Tuesday, May 17, 2011 11:09 AM To: Brendan Eich; Boris Zbarsky Cc: es-discuss Subject: RE: Full Unicode strings strawman I would much prefer changing UCS-2 to UTF-16, thus formalizing that surrogate pairs are permitted. That'd be very difficult to break any existing code and would still allow representation of everything reasonable in Unicode. That would enable Unicode, and allow extending string literals and regular expressions for convenience with the U+10 style notation (which would be equivalent to the surrogate pair). The character code manipulation functions could be similarly augmented without breaking anything (and maybe not needing different names?) You might want to qualify the UTF-16 as allowing, but strongly discouraging, lone surrogates for those people who didn't realize their binary data wasn't a string. The sole disadvantage would be that iterating through a string would require consideration of surrogates, same as today. The same caution is also necessary to avoid splitting Ä (U+0041 U+0308) into its component A and ̈ parts. I wouldn't be opposed to some sort of helper functions or classes that aided in walking strings, preferably with options to walk the graphemes (or whatever), not just the surrogate pairs. FWIW: we have such a helper for surrogates in .Net and nobody uses them. The most common feedback is that it's not that helpful because it doesn't deal with the graphemes. - Shawn shawn.ste...@microsoft.com Senior Software Design Engineer Microsoft Windows ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
RE: Full Unicode strings strawman
Right - but they are still legitimate code points, and they fill out the space required to let us treat String as uint16[] when defining the backing store as something that maps to the set of all Unicode code points. That said, you can encode these code points with utf-8; for example, 0xdc08 becomes 0xed 0xb0 0x88. No, you're allowing storage of some sort of number arrays that don't represent Unicode strings at all. Codepoints != encoding. High and Low surrogates are legal code points, but are only legitimate code points in UTF-16 if they occur in a pair. If they aren’t in a proper pair, they’re illegal. They are always illegal in UTF-32 UTF-8. There are other code points that shouldn’t be used for interchange in Unicode too: U+xx/U+xxFFFE for example. It’s orthogonal to the other question, but the documentation should clearly suggest that users don’t pretend binary data is character data when it’s not. That leads to all sorts of crazy stuff, like illegal lone surrogates trying to be illegally encoded in UTF-8. -Shawn ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode strings strawman
On May 17, 2011, at 12:00 PM, Phillips, Addison wrote: Note: The W3C Internationalization Core WG published a set of requirements in this area for consideration by ES some time ago. It lives here: http://www.w3.org/International/wiki/JavaScriptInternationalization You might want to formally convey these requests to TC39 via the W3C/Ecma liaison process. That would carry much more weight and visibility. I don't believe this document has shown up on any TC39 agenda or has been incorporated into any of our planning. Allen ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: arrow syntax unnecessary and the idea that function is too long
On Tue, May 17, 2011 at 10:50 AM, Axel Rauschmayer a...@rauschma.de wrote: On May 17, 2011, at 4:57, Peter Michaux petermich...@gmail.com wrote: The goal of pleasing compiler writers should be to make it possible to compile existing languages like Perl, Ruby, Python and Scheme to JavaScript. These languages are the ones that people know and really want to use and target their compilers to JavaScript. You sound like you really hate JavaScript and can’t imagine working with it unless some other language is compiled to it. Actually the opposite is true. I write in JavaScript all day and like it a lot. I wouldn't want to compile to JavaScript with today's possibility. What I was trying to express is that I believe dream of people who want to compile to JavaScript is to write in their server-side language of choice (e.g. Perl, Python, Ruby, Scheme, Java, etc) and compile that to JavaScript. Peter ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
RE: Full Unicode strings strawman
We did. Cf. http://lists.w3.org/Archives/Public/public-i18n-core/2009OctDec/0102.html Addison Addison Phillips Globalization Architect (Lab126) Chair (W3C I18N WG) Internationalization is not a feature. It is an architecture. -Original Message- From: Allen Wirfs-Brock [mailto:al...@wirfs-brock.com] Sent: Tuesday, May 17, 2011 12:16 PM To: Phillips, Addison Cc: Shawn Steele; Brendan Eich; Boris Zbarsky; es-discuss Subject: Re: Full Unicode strings strawman On May 17, 2011, at 12:00 PM, Phillips, Addison wrote: Note: The W3C Internationalization Core WG published a set of requirements in this area for consideration by ES some time ago. It lives here: http://www.w3.org/International/wiki/JavaScriptInternationalization You might want to formally convey these requests to TC39 via the W3C/Ecma liaison process. That would carry much more weight and visibility. I don't believe this document has shown up on any TC39 agenda or has been incorporated into any of our planning. Allen ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode strings strawman
On 17 May 2011 14:39, Boris Zbarsky bzbar...@mit.edu wrote: On 5/17/11 2:12 PM, Wes Garland wrote: That said, you can encode these code points with utf-8; for example, 0xdc08 becomes 0xed 0xb0 0x88. By the same argument, you can encode them in UTF-16. The byte sequence above is not valid UTF-8. See How do I convert an unpaired UTF-16 surrogate to UTF-8? at http://unicode.org/faq/utf_bom.html which says: You are comparing apples and oranges. Which happen to look a lot alike. So maybe apples and nectarines. But the point remains, the FAQ entry you quote talks about encoding a lone surrogate, i.e. a code unit, which is not a complete code point. You can only convert complete code points from one encoding to another. Just like you can't represent part of a UTF-8 code sub-sequence in any other encoding. The fact that code point X is not representable in UTF-16 has no bearing on its status as a code point, nor its convertability to UTF-8. The problem is that UTF-16 cannot represent all possible code points. See above. You're allowing number arrays that may or may not be interpretable as Unicode strings, period. No, I'm not. Any sequence of Unicode code points is a valid Unicode string. It does not matter whether any of those code points are reserved, nor does it matter if it can be represented in all encodings. From page 90 of the Unicode 6.0 specification, in the Conformance chapter: *D80 Unicode string:* A code unit sequence containing code units of a particular Unicode encoding form. • In the rawest form, Unicode strings may be implemented simply as arrays of the appropriate integral data type, consisting of a sequence of code units lined up one immediately after the other. • A single Unicode string must contain only code units from a single Unicode encoding form. It is not permissible to mix forms within a string. Not sure what (D80) is supposed to mean. Sorry, (D80) means per definition D80 of The Unicode Standard, Version 6.0 This hypothesis is worth testing before being blindly inflicted on the web. I don't think anybody in this discussion is talking about blindly inflicting anything on the web. I *do* think this proposal is a good one, and certainly a better way forward than insisting that every JS developer, everywhere, understand and implement (over and over again) the details of encoding Unicode as UTF-16. Allen's point about URI escaping being right on target here. If we redefine JS Strings to be arrays of Unicode code points, then the JS application can use JS Strings as arrays uint21 -- but round-tripping the high-surrogate code points through a UTF-16 layer would not work. OK, that seems like a breaking change. Yes, I believe it would be, certainly if done naively, but I am hopeful somebody can figure out how to overcome this. Hopeful because I think that fixing the JS Unicode problem is a really big deal. What happens if the guy types a non-BMP character? is a question which should not have to be answered over and over again in every code review. And I still maintain that 99.99% of JS developers never give it first, let alone second, thought. Maybe, and maybe not. We (Mozilla) have had some proposals to actually use UTF-8 throughout, including in the JS engine; it's quite possible to implement an API that looks like a 16-bit array on top of UTF-8 as long as you allow invalid UTF-8 that's needed to represent surrogates and the like. I understand by this that in the Moz proposals, you mean that the invalid UTF-8 sequences are actually valid UTF-8 Strings which encode code points in the range 0xd800-0xdfff, and that these code points were translated directly (and purposefully incorrectly) as UTF-16 code units when viewed as 16-bit arrays. If JS Strings were arrays of Unicode code points, this conversion would be a non-issue; UTF-8 sequence 0xed 0xb0 0x88 becomes Unicode code point 0xdc08, with no incorrect conversion taking place. The only problem is if there is an intermediate component somewhere that insists on using UTF-16..at that point we just can't represent code point 0xdc08 at all. But that code point will never appear in text; it will only appear for users using the String to store arbitrary data, and their need has already been met.. Wes -- Wesley W. Garland Director, Product Development PageMail, Inc. +1 613 542 2787 x 102 ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode strings strawman
On 17 May 2011 15:00, Phillips, Addison addi...@lab126.com wrote: 2. Allowing unpaired surrogates is a *requirement*. Yes, such a string is ill-formed, but there are too many cases in which one might wish to have such broken strings for scripting purposes. 3. We should have escape syntax for supplementary characters (such as \U001). Looking up the surrogate pair for a given Unicode character is extremely inconvenient and is not self-documenting. ... As Shawn notes, basically, there are three ways that one might wish to access strings: ... - as code units (encoding units of text) I don't understand why (except that it is there by an accident of history) that it is desirable to expose a particular low-level detail about one possible encoding for Unicode characters to end-user programmers. Your point about database storage only holds if the database happens to store Unicode strings encoded in UTF-16. It could just as easily use UTF-8, UTF-7, or UTF-32. For that matter, the database input routine could filter all characters not in ISO-Latin-1 and store only the lower half of non-surrogate-pair UTF-16 code units. Wes -- Wesley W. Garland Director, Product Development PageMail, Inc. +1 613 542 2787 x 102 ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode strings strawman
On 5/17/11 3:29 PM, Wes Garland wrote: But the point remains, the FAQ entry you quote talks about encoding a lone surrogate, i.e. a code unit, which is not a complete code point. You can only convert complete code points from one encoding to another. Just like you can't represent part of a UTF-8 code sub-sequence in any other encoding. The fact that code point X is not representable in UTF-16 has no bearing on its status as a code point, nor its convertability to UTF-8. The problem is that UTF-16 cannot represent all possible code points. My point is that neither can UTF-8. Can you name an encoding that _can_ represent the surrogate-range codepoints? From page 90 of the Unicode 6.0 specification, in the Conformance chapter: /D80 Unicode string:/ A code unit sequence containing code units of a particular Unicode encoding form. • In the rawest form, Unicode strings may be implemented simply as arrays of the appropriate integral data type, consisting of a sequence of code units lined up one immediately after the other. • A single Unicode string must contain only code units from a single Unicode encoding form. It is not permissible to mix forms within a string. Not sure what (D80) is supposed to mean. Sorry, (D80) means per definition D80 of The Unicode Standard, Version 6.0 Ah, ok. So the problem there is that this is definition only makes sense when a particular Unicode encoding form has been chosen. Which Unicode encoding form have we chosen here? But note also that D76 in that same document says: Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points. and D79 says: A Unicode encoding form assigns each Unicode scalar value to a unique code unit sequence. and To ensure that the mapping for a Unicode encoding form is one-to-one, all Unicode scalar values, including those corresponding to noncharacter code points and unassigned code points, must be mapped to unique code unit sequences. Note that this requirement does not extend to high-surrogate and low-surrogate code points, which are excluded by definition from the set of Unicode scalar values. In particular, this makes it clear (to me, at least) that whatever Unicode encoding form you choose, a Unicode string can only consist of code units encoding Unicode scalar values, which does NOT include high and low surrogates. Therefore I stand by my statement: if you allow what to me looks like arrays UTF-32 code units and also values that fall into the surrogate ranges then you don't get Unicode strings. You get a set of arrays that contains Unicode strings as a proper subset. OK, that seems like a breaking change. Yes, I believe it would be, certainly if done naively, but I am hopeful somebody can figure out how to overcome this. As long as we worry about that _before_ enshrining the result in a spec, I'm all of being hopeful. Maybe, and maybe not. We (Mozilla) have had some proposals to actually use UTF-8 throughout, including in the JS engine; it's quite possible to implement an API that looks like a 16-bit array on top of UTF-8 as long as you allow invalid UTF-8 that's needed to represent surrogates and the like. I understand by this that in the Moz proposals, you mean that the invalid UTF-8 sequences are actually valid UTF-8 Strings which encode code points in the range 0xd800-0xdfff There are no such valid UTF-8 strings; see spec quotes above. The proposal would have involved having invalid pseudo-UTF-ish strings. and that these code points were translated directly (and purposefully incorrectly) as UTF-16 code units when viewed as 16-bit arrays. Yep. If JS Strings were arrays of Unicode code points, this conversion would be a non-issue; UTF-8 sequence 0xed 0xb0 0x88 becomes Unicode code point 0xdc08, with no incorrect conversion taking place. Sorry, no. See above. The only problem is if there is an intermediate component somewhere that insists on using UTF-16..at that point we just can't represent code point 0xdc08 at all. I just don't get it. You can stick the invalid 16-bit value 0xdc08 into a UTf-16 string just as easily as you can stick the invalid 24-bit sequence 0xed 0xb0 0x88 into a UTF-8 string. Can you please, please tell me what made you decide there's _any_ difference between the two cases? They're equally invalid in _exactly_ the same way. -Boris ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode strings strawman
The wrong conclusion is being drawn. I can say definitively that for the string a\uD800b. - It is a valid Unicode string, according to the Unicode Standard. - It cannot be encoded as well-formed in any UTF-x (it is not 'well-formed' in any UTF). - When it comes to conversion, the bad code unit \uD800 needs to be handled (eg converted to FFFD, escaped, etc.) Any programming language using Unicode has the choice of either 1. allowing strings to be general Unicode strings, or 2. guaranteeing that they are always well-formed. There are trade-offs either way, but both are feasible. Mark *— Il meglio è l’inimico del bene —* On Tue, May 17, 2011 at 13:03, Boris Zbarsky bzbar...@mit.edu wrote: On 5/17/11 3:29 PM, Wes Garland wrote: But the point remains, the FAQ entry you quote talks about encoding a lone surrogate, i.e. a code unit, which is not a complete code point. You can only convert complete code points from one encoding to another. Just like you can't represent part of a UTF-8 code sub-sequence in any other encoding. The fact that code point X is not representable in UTF-16 has no bearing on its status as a code point, nor its convertability to UTF-8. The problem is that UTF-16 cannot represent all possible code points. My point is that neither can UTF-8. Can you name an encoding that _can_ represent the surrogate-range codepoints? From page 90 of the Unicode 6.0 specification, in the Conformance chapter: /D80 Unicode string:/ A code unit sequence containing code units of a particular Unicode encoding form. • In the rawest form, Unicode strings may be implemented simply as arrays of the appropriate integral data type, consisting of a sequence of code units lined up one immediately after the other. • A single Unicode string must contain only code units from a single Unicode encoding form. It is not permissible to mix forms within a string. Not sure what (D80) is supposed to mean. Sorry, (D80) means per definition D80 of The Unicode Standard, Version 6.0 Ah, ok. So the problem there is that this is definition only makes sense when a particular Unicode encoding form has been chosen. Which Unicode encoding form have we chosen here? But note also that D76 in that same document says: Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points. and D79 says: A Unicode encoding form assigns each Unicode scalar value to a unique code unit sequence. and To ensure that the mapping for a Unicode encoding form is one-to-one, all Unicode scalar values, including those corresponding to noncharacter code points and unassigned code points, must be mapped to unique code unit sequences. Note that this requirement does not extend to high-surrogate and low-surrogate code points, which are excluded by definition from the set of Unicode scalar values. In particular, this makes it clear (to me, at least) that whatever Unicode encoding form you choose, a Unicode string can only consist of code units encoding Unicode scalar values, which does NOT include high and low surrogates. Therefore I stand by my statement: if you allow what to me looks like arrays UTF-32 code units and also values that fall into the surrogate ranges then you don't get Unicode strings. You get a set of arrays that contains Unicode strings as a proper subset. OK, that seems like a breaking change. Yes, I believe it would be, certainly if done naively, but I am hopeful somebody can figure out how to overcome this. As long as we worry about that _before_ enshrining the result in a spec, I'm all of being hopeful. Maybe, and maybe not. We (Mozilla) have had some proposals to actually use UTF-8 throughout, including in the JS engine; it's quite possible to implement an API that looks like a 16-bit array on top of UTF-8 as long as you allow invalid UTF-8 that's needed to represent surrogates and the like. I understand by this that in the Moz proposals, you mean that the invalid UTF-8 sequences are actually valid UTF-8 Strings which encode code points in the range 0xd800-0xdfff There are no such valid UTF-8 strings; see spec quotes above. The proposal would have involved having invalid pseudo-UTF-ish strings. and that these code points were translated directly (and purposefully incorrectly) as UTF-16 code units when viewed as 16-bit arrays. Yep. If JS Strings were arrays of Unicode code points, this conversion would be a non-issue; UTF-8 sequence 0xed 0xb0 0x88 becomes Unicode code point 0xdc08, with no incorrect conversion taking place. Sorry, no. See above. The only problem is if there is an intermediate component somewhere that insists on using UTF-16..at that point we just can't represent code point 0xdc08 at all. I just don't get it. You can stick the invalid 16-bit value
Re: Full Unicode strings strawman
On 17 May 2011 16:03, Boris Zbarsky bzbar...@mit.edu wrote: On 5/17/11 3:29 PM, Wes Garland wrote: The problem is that UTF-16 cannot represent all possible code points. My point is that neither can UTF-8. Can you name an encoding that _can_ represent the surrogate-range codepoints? UTF-8 and UTF-32. I think UTF-7 can, too, but it is not a standard so it's not really worth discussing. UTF-16 is the odd one out. Therefore I stand by my statement: if you allow what to me looks like arrays UTF-32 code units and also values that fall into the surrogate ranges then you don't get Unicode strings. You get a set of arrays that contains Unicode strings as a proper subset. Okay, I think we have to agree to disagree here. I believe my reading of the spec is correct. There are no such valid UTF-8 strings; see spec quotes above. The proposal would have involved having invalid pseudo-UTF-ish strings. Yes, you can encode code points d800 - dfff in UTF-8 Strings. These are not *well-formed* strings, but they are Unicode 8-bit Strings (D81) nonetheless. What you can't do is encode 16-bit code units in UTF-8 Strings. This is because you can only convert from one encoding to another via code points. Code units have no cross-encoding meaning. Further, you can't encode code points d800 - dfff in UTF-16 Strings, leaving you at a loss when you want to store those values in JS Strings (i.e. when using them as uint16[]) except to generate ill-formed UTF-16. I believe it would be far better to treat those values as Unicode code points, not 16-bit code units, and to allow JS String elements to be able to express the whole 21-bit code point range afforded by Unicode. In other words, current mis-use of JS Strings which can store characters 0- in ill-formed UTF-16 strings would become use of JS Strings to store code points 0-1F which may use reserved code points d800-dfff, the high surrogates, which cannot be represented in UTF-16. But CAN be represented, without loss, in UTF-8, UTF-32, and proposed-new-JS-Strings. If JS Strings were arrays of Unicode code points, this conversion would be a non-issue; UTF-8 sequence 0xed 0xb0 0x88 becomes Unicode code point 0xdc08, with no incorrect conversion taking place. Sorry, no. See above. # printf '\xed\xb0\x88' | iconv -f UTF-8 -t UCS-4BE | od -x 000 dc08 004 # printf '\000\000\xdc\x08' | iconv -f UCS-4BE -t UTF-8 | od -x 000 edb0 8800 003 I just don't get it. You can stick the invalid 16-bit value 0xdc08 into a UTf-16 string just as easily as you can stick the invalid 24-bit sequence 0xed 0xb0 0x88 into a UTF-8 string. Can you please, please tell me what made you decide there's _any_ difference between the two cases? They're equally invalid in _exactly_ the same way. The difference is that in UTF-8, 0xed 0xb0 0x88 means The Unicode code point 0xdc08, and in UTF-16 0xdc08 means Part of some non-BMP code point. Said another way, 0xed in UTF-8 has nearly the same meaning as 0xdc08 in UTF-16. Both are ill-formed code unit subsequences which do not represent a code unit (D84a). Wes -- Wesley W. Garland Director, Product Development PageMail, Inc. +1 613 542 2787 x 102 ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode strings strawman
On 5/17/11 5:24 PM, Wes Garland wrote: UTF-8 and UTF-32. I think UTF-7 can, too, but it is not a standard so it's not really worth discussing. UTF-16 is the odd one out. That's not what the spec says. Okay, I think we have to agree to disagree here. I believe my reading of the spec is correct. Sorry, but no... how much more clear can the spec get? There are no such valid UTF-8 strings; see spec quotes above. The proposal would have involved having invalid pseudo-UTF-ish strings. Yes, you can encode code points d800 - dfff in UTF-8 Strings. These are not /well-formed/ strings, but they are Unicode 8-bit Strings (D81) nonetheless. The spec seems to pretty clearly define UTF-8 strings as things that do NOT contain the encoding of those code points. If you think otherwise, cite please. Further, you can't encode code points d800 - dfff in UTF-16 Strings, Where does the spec say this? And why does that part of the spec not apply to UTF-8? # printf '\xed\xb0\x88' | iconv -f UTF-8 -t UCS-4BE | od -x 000 dc08 004 # printf '\000\000\xdc\x08' | iconv -f UCS-4BE -t UTF-8 | od -x 000 edb0 8800 003 As far as I can tell, that second conversion is just an implementation bug per the spec. See the part I quoted which explicitly says that an encoder in that situation must stop and return an error. The difference is that in UTF-8, 0xed 0xb0 0x88 means The Unicode code point 0xdc08 According to the spec you were citing, that code unit sequence means a UTF-8 decoder should error, no? -Boris ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: arrow syntax unnecessary and the idea that function is too long
On May 17, 2011, at 5:04 PM, Kyle Simpson wrote: Regarding the - and = syntax, I just want to throw out one other concern that I hope is taken into account, not only now, but for the future: I really hope that we don't get to the point where we start adding functionality to that style of function that is not available to explicit functions (we're almost, but not, there with having = do the magical `this` binding). You have to distinguish syntax from semantics. There's nothing proposed for arrow functions that is more than shorter syntax -- including |this| binding. I know Brendan and others have declared it's shorthand only, but it can be a slippery slope, and to rely on the if you don't like - don't use it argument, we have to make sure that it really stays only a shorthand and nothing more, otherwise it's tail-wagging-the-dog. Agreed, which is why I'm still going to write up a fairy radical Ruby-block proposal that competes in this sense: it too gives better function syntax, plus semantics not available to arrow functions as constrained to be just syntax. I hope we'll be able to decide between these two approaches quickly, since I do not want to do both. That is, either arrow functions win and shorter syntax is enough; or we have blocks for control abstraction (which means other syntax changes, details soon) and no arrow functions. In other words, I hope that those who favor - aren't also hoping that eventually - replaces `function` entirely. As stated many times thus far in this thread, there are still those of us who favor (and maybe always will) the explicitness of `function(){}` or `#(){}`. There's no way to remove 'function' long syntax from JS. Just no way. Your point about just syntax is well-taken, since translation tools will be important in aiding Harmony migration -- not just for targeting downrev browsers but for added static checking -- and these should be as simple (local rewriting, e.g. transpilers not compilers) as possible. /be ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode strings strawman
On 17 May 2011 20:09, Boris Zbarsky bzbar...@mit.edu wrote: On 5/17/11 5:24 PM, Wes Garland wrote: Okay, I think we have to agree to disagree here. I believe my reading of the spec is correct. Sorry, but no... how much more clear can the spec get? In the past, I have read it thus, pseudo BNF: UnicodeString = CodeUnitSequence // D80 CodeUnitSequence = CodeUnit | CodeUnitSequence CodeUnit // D78 CodeUnit = anything in the current encoding form // D77 Upon careful re-reading of this part of the specification, I see that D79 is also important. It says that A Unicode encoding form assigns each Unicode scalar value to a unique code unit sequence., and further clarifies that The mapping of the set of Unicode scalar values to the set of code unit sequences for a Unicode encoding form is one-to-one. This means that your original assertion -- that Unicode strings cannot contain the high surrogate code points, regardless of meaning -- is in fact correct. Which is unfortunate, as it means that we either 1. Allow non-Unicode strings in JS -- i.e. Strings composed of all values in the set [0x0, 0x1F] 2. Keep making programmers pay the raw-UTF-16 representation tax 3. Break the String-as-uint16 pattern I still believe that #1 is the way forward, and that problem of round-tripping these values through the DOM is solvable. Wes -- Wesley W. Garland Director, Product Development PageMail, Inc. +1 613 542 2787 x 102 ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode strings strawman
That is incorrect. See below. Mark *— Il meglio è l’inimico del bene —* On Tue, May 17, 2011 at 18:33, Wes Garland w...@page.ca wrote: On 17 May 2011 20:09, Boris Zbarsky bzbar...@mit.edu wrote: On 5/17/11 5:24 PM, Wes Garland wrote: Okay, I think we have to agree to disagree here. I believe my reading of the spec is correct. Sorry, but no... how much more clear can the spec get? In the past, I have read it thus, pseudo BNF: UnicodeString = CodeUnitSequence // D80 CodeUnitSequence = CodeUnit | CodeUnitSequence CodeUnit // D78 CodeUnit = anything in the current encoding form // D77 So far, so good. In particular, d800 is a code unit for UTF-16, since it is a code unit that can occur in some code unit sequence in UTF-16. Upon careful re-reading of this part of the specification, I see that D79 is also important. It says that A Unicode encoding form assigns each Unicode scalar value to a unique code unit sequence., True. and further clarifies that The mapping of the set of Unicode scalar values to the set of code unit sequences for a Unicode encoding form is one-to-one. True. This is all consistent with saying that UTF-16 can't contain an isolated d800. *However, that only shows that a Unicode 16-bit string (D82) is not the same as a UTF-16 String (D89), which has been pointed out previously.* * * Repeating the note under D89: A Unicode string consisting of a well-formed UTF-16 code unit sequence is said to be *in UTF-16*. Such a Unicode string is referred to as a *valid UTF-16 string*, or a *UTF-16 string* for short. * * That is, every UTF-16 string is a Unicode 16-bit string, but *not* vice versa. Examples: - \u0061\ud800\udc00 is both a Unicode 16-bit string and a UTF-16 string. - \u0061\ud800\udc00 is a Unicode 16-bit string, but not a UTF-16 string. This means that your original assertion -- that Unicode strings cannot contain the high surrogate code points, regardless of meaning -- is in fact correct. That is incorrect. Which is unfortunate, as it means that we either 1. Allow non-Unicode strings in JS -- i.e. Strings composed of all values in the set [0x0, 0x1F] 2. Keep making programmers pay the raw-UTF-16 representation tax 3. Break the String-as-uint16 pattern I still believe that #1 is the way forward, and that problem of round-tripping these values through the DOM is solvable. Wes -- Wesley W. Garland Director, Product Development PageMail, Inc. +1 613 542 2787 x 102 ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Private Names in 'text/javascript'
The Private Names strawman currently combines a new runtime capability (using both strings and private names as keys in objects) with several new syntactic constructs (private binding declarations, #.id). At the March meeting, I recall there was some support for the idea of separating these two aspects, and exposing the runtime capability also as a library that could be used in 'text/javascript'. I added a comment to the Private Names strawman page to suggest how this could be done. The runtime behavior of the proposal is the same, but in addition, a library function Object.createPrivteName(name) is added which provides direct access to the internal CreatePrivateName abstract operation. This allows the use of private names in a more verbose form, but without needing new syntax - similar in spirit to the ES5 Object.* operations. Borrowing an example from the current proposal to illustrate: Using 'text/harmony' syntax: function Point(x,y) { private x, y; this.x = x; this.y = y; //... methods that use private x and y properties } var pt = new Point(1,2); Using 'text/javascript' syntax: function Point(x,y) { var _x = Object.createPrivateName(x); var _y = Object.createPrivateName(y); this[_x] = x; this[_y] = y; //... methods that use private _x and _y properties } var pt = new Point(1,2); There seem to be several benefits to this: (1) The private name capability can be made available to 'text/javascript' (2) The feature is easily feature-detectable, with a fallback of using '_'-prefixed or similar pseudo-private conventions (3) The core functionality can potentially be agreed upon and implemented in engines earlier than full new syntax Luke ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Private Names in 'text/javascript'
Yes, I agree that separating them out is a good idea. Allen and I have been working on this lately, and I've signed up to present private names at the upcoming face-to-face. Our thinking has been along similar lines to what you describe here. Dave On May 17, 2011, at 6:55 PM, Luke Hoban wrote: The Private Names strawman currently combines a new runtime capability (using both strings and private names as keys in objects) with several new syntactic constructs (private binding declarations, #.id). At the March meeting, I recall there was some support for the idea of separating these two aspects, and exposing the runtime capability also as a library that could be used in ‘text/javascript’. I added a comment to the Private Names strawman page to suggest how this could be done. The runtime behavior of the proposal is the same, but in addition, a library function “Object.createPrivteName(name)” is added which provides direct access to the internal CreatePrivateName abstract operation. This allows the use of private names in a more verbose form, but without needing new syntax – similar in spirit to the ES5 Object.* operations. Borrowing an example from the current proposal to illustrate: Using ‘text/harmony’ syntax: function Point(x,y) { private x, y; this.x = x; this.y = y; //... methods that use private x and y properties } var pt = new Point(1,2); Using ‘text/javascript’ syntax: function Point(x,y) { var _x = Object.createPrivateName(x); var _y = Object.createPrivateName(y); this[_x] = x; this[_y] = y; //... methods that use private _x and _y properties } var pt = new Point(1,2); There seem to be several benefits to this: (1) The private name capability can be made available to ‘text/javascript’ (2) The feature is easily feature-detectable, with a fallback of using ‘_’-prefixed or similar pseudo-private conventions (3) The core functionality can potentially be agreed upon and implemented in engines earlier than full new syntax Luke ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
RE: Private Names in 'text/javascript'
Yes, I agree that separating them out is a good idea. Allen and I have been working on this lately, and I've signed up to present private names at the upcoming face-to-face. Our thinking has been along similar lines to what you describe here. Dave Great - I see the new unique_string_values strawman now. That looks like it does address the same goal. Happy to see there is already progress on this. Was there a particular reason for the shift to treating these names as a new kind of string value instead of as a separate object kind which could be used as a key in objects? Luke ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Private Names in 'text/javascript'
Yes (from my perspective) but it is something we are still hashing out so don't assume that will be the final proposal. Allen On May 17, 2011, at 7:33 PM, Luke Hoban wrote: Yes, I agree that separating them out is a good idea. Allen and I have been working on this lately, and I've signed up to present private names at the upcoming face-to-face. Our thinking has been along similar lines to what you describe here. Dave Great – I see the new unique_string_values strawman now. That looks like it does address the same goal. Happy to see there is already progress on this. Was there a particular reason for the shift to treating these names as a new kind of string value instead of as a separate object kind which could be used as a key in objects? Luke ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
prototype for operator proposal for review
We had so much fun with feedback on my Unicode proposal I just have open another one up for list feed back: An updated version of the prototype for (formerly proto) operator proposal is at http://wiki.ecmascript.org/doku.php?id=strawman:proto_operator Allen___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode strings strawman
Mark; Are you Dr. *Mark E. Davis* (born September 13, 1952 (age 58)), co-founder of the Unicode http://en.wikipedia.org/wiki/Unicode project and the president of the Unicode Consortiumhttp://en.wikipedia.org/wiki/Unicode_Consortiumsince its incorporation in 1991? (If so, uh, thanks for giving me alternatives to Shift-JIS, GB-2312, Big-5, et al..those gave me lots of hair loss in the late 90s) On 17 May 2011 21:55, Mark Davis ☕ m...@macchiato.com wrote:In the past, I have read it thus, pseudo BNF: UnicodeString = CodeUnitSequence // D80 CodeUnitSequence = CodeUnit | CodeUnitSequence CodeUnit // D78 CodeUnit = anything in the current encoding form // D77 So far, so good. In particular, d800 is a code unit for UTF-16, since it is a code unit that can occur in some code unit sequence in UTF-16. *head smack* - code unit, not code point. This means that your original assertion -- that Unicode strings cannot contain the high surrogate code points, regardless of meaning -- is in fact correct. That is incorrect. Aie, Karumba! If we have - a sequence of code points - taking on values between 0 and 0x1F - including high surrogates and other reserved values - independent of encoding ..what exactly are we talking about? Can it be represented in UTF-16 without round-trip loss when normalization is not performed, for the code points 0 through 0x? Incidentally, I think this discussion underscores nicely why I think we should work hard to figure out a way to hide UTF-16 encoding details from user-end programmers. Wes -- Wesley W. Garland Director, Product Development PageMail, Inc. +1 613 542 2787 x 102 ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: I noted some open issues on Classes with Trait Composition
On Sun, May 15, 2011 at 10:01 PM, Brendan Eich bren...@mozilla.com wrote: http://wiki.ecmascript.org/doku.php?id=strawman:classes_with_trait_composition#open_issues That wiki page has no had extensive revisions in light of recent discussions with Brendan, Allen, Dave Herman, and Bob Nystrom. It derives from previous discussions with Allen, Bob, and Peter Hallam. I have tried to capture here as best as I could the consensus that has emerged from these discussions. All this derives from earlier discussions that also included Waldemar, Alex Russell, Arv, and Tom Van Cutsem. And the experience of the Traceur project made a significant contribution. This has all had a long history so if I've left out some key contributors, please let me know, thanks. This looks pretty good at a glance, but it's a *lot*, and it's new. It's much less now! The main effect of all the recent feedback was to find opportunities to remove things. What remains is mostly just a way to express the familiar pattern by which JavaScript programmers manually express class-like semantics using prototypes. The result interoperates in both directions with such old code: a class can inherit from a traditional constructor function and vice versa. I have to say this reminds me of ES4 classes. That's neither bad nor good, but it's not just superficial, as far as I can tell (and I was reading specs then and now). It definitely had an influence. There were many things I liked about ES4 classes. On the other hand, I'm in no rush to standardize something this complex and yet newly strawman-spec'ed and yet unimplemented. So we may as well take our time, learn from history, and go around the karmic wheel again for another few years... I'm not against classes as a near-term objective, but in order to *be*near-term and not to unwind in committee, I believe they have to be dead simple and prototypal, with very few knobs, bells and whistles. I am indeed proposing this as a near term objective. The usual caveats apply: we are asking the committee to approve the general shape presented by this strawman, with syntactic and semantic refinements expected to continue, for this as for all other proposals, after May. Brendan, with all the simplifications since you posted this email, in your opinion, have we achieved the level of simplicity needed? Factoring out privacy Done. and leaving constructor in charge of per-instance property setting, as it is in ES5, Done. would IMHO help. Hope so ;). I do understand that this page may be hard to appreciate without motivation and examples. I'm hoping these are coming soon. -- Cheers, --MarkM ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: I noted some open issues on Classes with Trait Composition
On Sun, May 15, 2011 at 11:49 PM, Brendan Eich bren...@mozilla.com wrote: On May 15, 2011, at 10:01 PM, Brendan Eich wrote: http://wiki.ecmascript.org/doku.php?id=strawman:classes_with_trait_composition#open_issues This looks pretty good at a glance, but it's a *lot*, and it's new. Looking closer, I have to say something non-nit-picky that looks bad and smells like committee: http://wiki.ecmascript.org/doku.php?id=strawman:classes_with_trait_composition#inheritance Two kinds of inheritance, depending on the dynamic type of the result of evaluating the //MemberExpression// on the right of ''extends''? That will be confusing. This smell is actually just my fault; it did not derive from ideas arrived at in meetings. In any case, it is gone. super(x, y); is now always simply equivalent to Superclass.call(this, x, y);, but as if using the original rather than the current binding of Function.prototype.call. Is the traits-composition way really needed in this proposal? If so, then please consider not abuse ''extends'' to mean ''compose'' depending on dynamic type of result of expression to its right. All dependencies on traits have been separated into a separate strawman, extending this one, but not to be proposed until after ES-next. The only inheritance in this one is traditional JS prototypal inheritance. -- Cheers, --MarkM ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: I noted some open issues on Classes with Trait Composition
On Mon, May 16, 2011 at 4:54 AM, Dmitry A. Soshnikov dmitry.soshni...@gmail.com wrote: [...] Some simple examples of all use-cases would are needed I think. Absolutely agree. I hope they are coming soon. Watch this space ;). Regarding `new` keyword for the constructor (aka initializer), after all, it als may be OK. E.g. Ruby uses `new` as exactly the method of a class -- Array.new, Object.new, etc. Though, `constructor` is also good yeah. The history here is interesting. An earlier unreleased version of the Traceur compiler used constructor. When we saw Allen's use of new in one of the object-literal-based class proposals, it seemed like a good idea so we switched to that. In light of Brendan's criticism, we realized we should return to constructor -- it's an elegant pun. Regarding two inheritance types, I think better to make nevertheless one inheritance type -- linear (by prototype chain). Done. And to make additionally small reusable code units -- mixins or traits -- no matter. Thus, of course if they will also be delegation-based and not just copy-own-properties, then we automatically get a sort of multiple inheritance. Gone. Or rather, postponed into a strawman that will not be proposed till after ES-next. [...] -- Cheers, --MarkM ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
RE: prototype for operator proposal for review
If there were a more usable library variant of Object.create instead, it seems the new syntax here would not be as necessary. Instead of: var o = myProto | { a: 0, b: function () {} } You could do: var o = Object.make(myProto, { a: 0, b: function () {} }) A few more characters, but still addresses the major issue preventing wider Object.create usage (the use of property descriptors). A library solution also keeps the benefit of not needing new syntax, and being available to text/javascript. As noted in the strawman, similar functions on Array and Function could support the other scenarios described in the proposal. It seems the syntax is perhaps aiming to avoid needing to allocate an intermediate object - but I imagine engines could potentially do that for Object.make and friends as well if it was important for performance? Luke From: es-discuss-boun...@mozilla.org [mailto:es-discuss-boun...@mozilla.org] On Behalf Of Allen Wirfs-Brock Sent: Tuesday, May 17, 2011 7:50 PM To: es-discuss@mozilla.org Subject: prototype for operator proposal for review We had so much fun with feedback on my Unicode proposal I just have open another one up for list feed back: An updated version of the prototype for (formerly proto) operator proposal is at http://wiki.ecmascript.org/doku.php?id=strawman:proto_operator Allen ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: prototype for operator proposal for review
On 05/17/2011 09:49 PM, Luke Hoban wrote: It seems the syntax is perhaps aiming to avoid needing to allocate an intermediate object – but I imagine engines could potentially do that for Object.make and friends as well if it was important for performance? It's probably possible to do that. But such hacks are rather fragile. I suspect this would take roughly the form of the way SpiderMonkey optimizes Function.prototype.apply, which is roughly to look for calls of properties named apply and do special-case behavior with a PIC in the case that that property is actually |Function.prototype.apply|. It takes some pretty gnarly code, duplicated two places (possibly a third, but that might not be necessary), to make it all happen. That sort of pattern certainly can be repeated if push comes to shove. But I believe doing so is far inferior to dedicated, first-class syntactical support to make the semantics absolutely unambiguous and un-confusable with anything else. In this particular case, I suspect implementing a PIC that way would be even gnarlier, because it wouldn't just be a PIC on the identity of the |Object.make| property, it'd have to also apply to computation of the arguments provided in the function call (or a not-call if you're using a PIC this way). That too can probably be done. But it'd be pretty tricky (thinking of things like the PIC only being applicable if the argument is an object literal, and of it being mostly inapplicable if it's anything else). And if you wanted to extend that to apply to more functions than just a single Object.make function, the hacks will be even more complex, possibly not even by a constant increment. And of course this would also make it harder for IDEs and such to give good first-class syntax highlighting here, because the syntax for this would be ambiguous with user-created stuff. Anyway, food for thought. And I know others here are more familiar with this than I am, so please chime in with more if you have it, or corrections if you have them. Jeff ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss