Sugar for *.prototype and for calling methods as functions
Sorry if it already has been picked up (I searched and didn't found anything close that). In my last months of work with JavaScript what that I miss a lot in ES5 syntax is: 1. Syntax shortcut for '.prototype'. Instead of writing String.prototype.trim I'd love to be able to write for example String#trim (it's not proposal, just example how it may look). As most native ES methods are generic there are a lot of valid use cases for that e.g.: Array#foEach.call(listThatsNotArray, fn); 2. Syntax sugar for calling method as a function. In following examples I just place '@' at end of method that I'd like to be run as function. Array#forEach@(listThatsNotArray, fn)); or trimmedListOfStrings = listOfStrings.map(String#trim@); Last example is same as following in ES5: trimmedListOfStrings = listOfStrings.map(Function.prototype.call.bind(String.prototype.trim)); This two proposals will make methods easily accessible for some functional constructs, and I think might be revolutionary for those who favor such functional way of programming. Let me know what do you think about that. -- Mariusz Nowak https://github.com/medikoo http://twitter.com/medikoo - Mariusz Nowak https://github.com/medikoo -- View this message in context: http://old.nabble.com/Sugar-for-*.prototype-and-for-calling-methods-as-functions-tp33363174p33363174.html Sent from the Mozilla - ECMAScript 4 discussion mailing list archive at Nabble.com. ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Sugar for *.prototype and for calling methods as functions
This request is the very definition of little things that go a long way. I write a hell of a lot of code that boils down to Function.prototype.bind(Function.prototype.call/apply, Somebuiltin.prototype.method). The fact that there's builtin way to accomplish `string.split(\n).map(String.split)` (split just as an example) is annoying in how obvious it is that it should work, and how often I need them. In fact I think there's some modification to Spidermonkey that has this? Array.* and String.* being functional versions of the prototype methods. ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Sugar for *.prototype and for calling methods as functions
error in the example should be:`string.split(\n).map(String.trim)` ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Sugar for *.prototype and for calling methods as functions
The established way of doing this is [].forEach, .trim, {}.valueOf. I imagine that by now, there would be no performance penalty, any more, because most engines are aware of this (ab)use. But it is indeed not very intention-revealing. It might make sense to wait with this proposal until classes are finished, but they probably won’t introduce any changes at this level. On Feb 21, 2012, at 12:16 , Mariusz Nowak wrote: Sorry if it already has been picked up (I searched and didn't found anything close that). In my last months of work with JavaScript what that I miss a lot in ES5 syntax is: 1. Syntax shortcut for '.prototype'. Instead of writing String.prototype.trim I'd love to be able to write for example String#trim (it's not proposal, just example how it may look). As most native ES methods are generic there are a lot of valid use cases for that e.g.: Array#foEach.call(listThatsNotArray, fn); 2. Syntax sugar for calling method as a function. In following examples I just place '@' at end of method that I'd like to be run as function. Array#forEach@(listThatsNotArray, fn)); or trimmedListOfStrings = listOfStrings.map(String#trim@); Last example is same as following in ES5: trimmedListOfStrings = listOfStrings.map(Function.prototype.call.bind(String.prototype.trim)); This two proposals will make methods easily accessible for some functional constructs, and I think might be revolutionary for those who favor such functional way of programming. Let me know what do you think about that. -- Dr. Axel Rauschmayer a...@rauschma.de home: rauschma.de twitter: twitter.com/rauschma blog: 2ality.com ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Sugar for *.prototype and for calling methods as functions
I would ask as an exploratory idea: is there any interest in, and what problems exist with exposing most {Builtin}.prototype.* methods as unbound functional {Builtin}.* functions. Or failing that, a more succint expression for the following: Function.prototype.[call/apply].bind({function}). Array.prototype.[map/reduce/forEach].call(arraylike, callback) Object.set('key', va) Basically, JavaScript has incredible usage potential as a functional language but has almost not built in support in terms of applyable functions. It teases you with its charms and then gives no direct payout, instead asking you to put just one more dollar in for the good stuff. ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Sugar for *.prototype and for calling methods as functions
On 21 February 2012 13:59, Brandon Benvie bran...@brandonbenvie.com wrote: I would ask as an exploratory idea: is there any interest in, and what problems exist with exposing most {Builtin}.prototype.* methods as unbound functional {Builtin}.* functions. Or failing that, a more succint expression for the following: Function.prototype.[call/apply].bind({function}). Array.prototype.[map/reduce/forEach].call(arraylike, callback) Object.set('key', va) There is a proposal for making available existing functions via modules in ES6: http://wiki.ecmascript.org/doku.php?id=harmony:modules_standard If there are methods missing from this list that can reasonably be used as stand-alone functions, then I'm sure nobody will object to adding them. /Andreas ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: New full Unicode for ES6 idea
On 02/20/12 16:47, Brendan Eich wrote: Andrew Oakley wrote: Issues only arise in code that tries to treat a string as an array of 16-bit integers, and I don't think we should be particularly bothered by performance of code which misuses strings in this fashion (but clearly this should still work without opt-in to new string handling). This is all strings in JS and the DOM, today. That is, we do not have any measure of code that treats strings as uint16s, forges strings using \u, etc. but the ES and DOM specs have allowed this for 14 years. Based on bitter experience, it's likely that if we change by fiat to 21-bit code points from 16-bit code units, some code on the Web will break. Sorry, I don't think I was particularly clear. The point I was trying to make is that we can *pretend* that code points are 16-bit but actually use a 21-bit representation internally. If content requests proper Unicode support we simply switch to allowing 21-bit code-points and stop encoding characters outside the BMP using surrogate pairs (because the characters now fit in a single code point). And as noted in the o.p. and in the thread based on Allen's proposal last year, browser implementations definitely count on representation via array of 16-bit integers, with length property or method counting same. Breaking the Web is off the table. Breaking implementations, less so. I'm not sure why you bring up UTF-8. It's good for encoding and decoding but for JS, unlike C, we want string to be a high level full Unicode abstraction. Not bytes with bits optionally set indicating more bytes follow to spell code points. Yes, I probably shouldn't have brought up UTF-8 (we do store strings using UTF-8, I was thinking about our own implementation). The intention was not to break the web, my comments about issues when strings were misused were purely *performance* concerns, behaviour would otherwise remain unchanged (unless full Unicode support had been enabled). -- Andrew Oakley ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: New full Unicode for ES6 idea
On 21 February 2012 00:03, Brendan Eich bren...@mozilla.com wrote: These are byte-based enodings, no? What is the problem inflating them by zero extension to 16 bits now (or 21 bits in the future)? You can't make an invalid Unicode character from a byte value. One of my examples, GB 18030, is a four-byte encoding and a Chinese government standard. It is a mapping onto Unicode, but this mapping is table-driven rather than algorithm driven like the UTF-* transport formats. To provide a single example, Unicode 0x2259 maps onto GB 18030 0x8136D830. You're right about Big5 being byte-oriented, maybe this was a bad example, although it is a double-byte charset. It works by putting ASCII down low making bytes above 0x7f escapes into code pages dereferenced by the next byte. Each code point is encoded with one or two bytes, never more. If I were developing with Big5 in JS, I would store the byte stream 4a 4b d8 00 c1 c2 4c as 004a 004b d800 c1c2 004c. This would allow me to use JS regular expressions and so on. Anyway, Big5 punned into JS strings (via a C or C++ API?) is *not* a strong use-case for ignoring invalid characters. Agreed - I'm stretching to see if I can stretch far enough to find a real problem with BRS -- because I really want it. But the data does not need to arrive from C API -- it could easily be delivered by an XHR request where, say, the remote end dumps database rows into a transport format based around evaluating JS string literals (like JSON). Ball one. :-P If I hit the batter, does he get to first base? We still haven't talked about equality and normalization, I suppose that can wait. Wes -- Wesley W. Garland Director, Product Development PageMail, Inc. +1 613 542 2787 x 102 ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Sugar for *.prototype and for calling methods as functions
There is a proposal for making available existing functions via modules in ES6: http://wiki.ecmascript.org/doku.php?id=harmony:modules_standard If there are methods missing from this list that can reasonably be used as stand-alone functions, then I'm sure nobody will object to adding them. Beautiful, no more constructors-as-poor-man’s-namespaces. All generic methods could be in such modules, with uncurried `this`. But, with generic methods I’m undecided, they also make sense as “static” methods: http://wiki.ecmascript.org/doku.php?id=strawman:array_statics -- Dr. Axel Rauschmayer a...@rauschma.de home: rauschma.de twitter: twitter.com/rauschma blog: 2ality.com ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: New full Unicode for ES6 idea
Andrew Oakley wrote: On 02/20/12 16:47, Brendan Eich wrote: Andrew Oakley wrote: Issues only arise in code that tries to treat a string as an array of 16-bit integers, and I don't think we should be particularly bothered by performance of code which misuses strings in this fashion (but clearly this should still work without opt-in to new string handling). This is all strings in JS and the DOM, today. That is, we do not have any measure of code that treats strings as uint16s, forges strings using \u, etc. but the ES and DOM specs have allowed this for 14 years. Based on bitter experience, it's likely that if we change by fiat to 21-bit code points from 16-bit code units, some code on the Web will break. Sorry, I don't think I was particularly clear. The point I was trying to make is that we can*pretend* that code points are 16-bit but actually use a 21-bit representation internally. So far, that's like Allen's proposal from last year (http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings). But you didn't say how iteration (indexing and .length) work. If content requests proper Unicode support we simply switch to allowing 21-bit code-points and stop encoding characters outside the BMP using surrogate pairs (because the characters now fit in a single code point). How does content request proper Unicode support? Whatever that gesture is, it's big and red ;-). But we don't have such a switch or button to press like that, yet. If a .js or .html file as fetched from a server has a UTF-8 encoding, indeed non-BMP characters in string literals will be transcoded in open-source browsers and JS engines that use uint16 vectors internally, but each part of the surrogate pair will take up one element in the uint16 vector. Let's take this now as a content request to use full Unicode. But the .js file was developed 8 years ago and assumes two code units, not one. It hardcodes for that assumption, somehow (indexing, .length exact value, indexOf('\ud800'), etc.). It is now broken. And non-literal non-BMP characters won't be helped by transcoding differently when the .js or .html file is fetched. They'll just change size at runtime. /be ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: New full Unicode for ES6 idea
Brendan Eich wrote: in open-source browsers and JS engines that use uint16 vectors internally Sorry, that reads badly. All I meant is that I can't tell what closed-source engines do, not that they do not comply with ECMA-262 combined with other web standards to have the same observable effect, e.g. Allen's example: var c = // where the single character between the quotes is the Unicode character U+1f638 c.length == 2; c === \ud83d\ude38; //the two character UTF-16 encoding of 0x1f683 c.charCodeAt(0) == 0xd83d; c.charCodeAt(1) == 0xd338; Still no BRS to set, we need one if we want a full-Unicode outcome (c.length == 1, etc.). /be ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
RE: New full Unicode for ES6 idea
Normalization happens to source upstream of the JS engine. Here I'll call on a designated Unicode hitter. ;-) I agree that Unicode Normalization shouldn't happen automagically in the JS engine. I rather doubt that normalization happens to source upstream of the JS engine, unless by upstream you mean best see to the normalization yourself. By contrast, providing a method for normalizing strings would be useful. Addison ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: New full Unicode for ES6 idea
Phillips, Addison wrote: Normalization happens to source upstream of the JS engine. Here I'll call on a designated Unicode hitter. ;-) I agree that Unicode Normalization shouldn't happen automagically in the JS engine. I rather doubt that normalization happens to source upstream of the JS engine, unless by upstream you mean best see to the normalization yourself. Yes ;-). I meant ECMA-262 punts source normalization upstream in the spec pipeline that runs parallel to the browser's loading-the-URL | processing-what-was-loaded pipeline. ECMA-262 is concerned only with its little slice of processing heaven. By contrast, providing a method for normalizing strings would be useful. /summon Norbert. /be ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
RE: New full Unicode for ES6 idea
I meant ECMA-262 punts source normalization upstream in the spec pipeline that runs parallel to the browser's loading-the-URL | processing-what-was- loaded pipeline. ECMA-262 is concerned only with its little slice of processing heaven. Yep. One of the problems is that the source script may not be using a Unicode encoding or may be using a Unicode encoding and be serialized in a non-normalized form. Your slice of processing heaven treats Unicode-normalization-equivalent-yet-different-codepoint-sequence tokens as unequal. Not that this is a bad thing. By contrast, providing a method for normalizing strings would be useful. /summon Norbert. (hides the breakables, listens for thunder) Addison ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
RE: New full Unicode for ES6 idea
Because it has always been possible, it’s difficult to say how many scripts have transported byte-oriented data by “punning” the data into strings. Actually, I think this is more likely to be truly binary data rather than text in some non-Unicode character encoding, but anything is possible, I suppose. This could include using non-character values like “FFFE”, “” in addition to the surrogates. A BRS-running implementation would break a script that relied on String being a sequence of 16-bit unsigned integer values with no error checking. One of my examples, GB 18030, is a four-byte encoding and a Chinese government standard. It is a mapping onto Unicode, but this mapping is table-driven rather than algorithm driven like the UTF-* transport formats. To provide a single example, Unicode 0x2259 maps onto GB 18030 0x8136D830. AP GB 18030 is more complex than that. Not all characters are four-byte, for example. As a multibyte encoding, you might choose to “pun” GB 18030 into a String as 81 36 d8 30. There isn’t much attraction to punning it into 0x8136 0xd830, but, as noted above, someone might be foolish enough to try it ;-). Scripts that rely on this probably break under BRS. You're right about Big5 being byte-oriented, maybe this was a bad example, although it is a double-byte charset. It works by putting ASCII down low making bytes above 0x7f escapes into code pages dereferenced by the next byte. Each code point is encoded with one or two bytes, never more. If I were developing with Big5 in JS, I would store the byte stream 4a 4b d8 00 c1 c2 4c as 004a 004b d800 c1c2 004c. This would allow me to use JS regular expressions and so on. Not exactly. The trailing bytes in Big5 start at 0x40, for example. But it is certainly the case that some multibyte characters in Big5 happen to have the same byte-pair as a surrogate code point (when considered as a pair of bytes) or other non-character in the Unicode BMP, and one might (he says, squinting really hard) want to do as you suggest and record the multibyte sequence as a single code point. But the data does not need to arrive from C API -- it could easily be delivered by an XHR request where, say, the remote end dumps database rows into a transport format based around evaluating JS string literals (like JSON). Allowing isolated invalid sequences isn’t actually the problem, if you think about it. Yes, the data is bad and yes you can’t view it cleanly. But you can do whatever you need to on it. The problem is when you intend to store two values that end up as a single character. If I have a string with code points “f235 5e7a e040 d800”, the d800 does no particular harm. The problem is: if I construct a BRS string using that sequence and then concatenate the sequence “dc00 a053 3254” onto it, the resulting string is only *six* characters long, rather than the expected seven, since presumably the d800 dc00 pair turns into U+1. Addison ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: New full Unicode for ES6 idea
Phillips, Addison wrote: Because it has always been possible, it’s difficult to say how many scripts have transported byte-oriented data by “punning” the data into strings. Actually, I think this is more likely to be truly binary data rather than text in some non-Unicode character encoding, but anything is possible, I suppose. This could include using non-character values like “FFFE”, “” in addition to the surrogates. A BRS-running implementation would break a script that relied on String being a sequence of 16-bit unsigned integer values with no error checking. Allen's view of the BRS-enabled semantics would have 16-bit GIGO without exceptions -- you'd be storing 16-bit values, whatever their source (including \u literals spelling invalid characters and unmatched surrogates) in at-least-21-bit elements of strings, and reading them back. My concern and reason for advocating early or late errors on shenanigans was that people today writing surrogate pais literally and then taking extra pains in JS or C++ (whatever the host language might be) to process them as single code points and characters would be broken by the BRS-enabled behavior of separating the parts into distinct code points. But that's pessimistic. It could happen, but OTOH anyone coding surrogate pairs might want them to read back piece-wise when indexing. In that case what Allen proposes, storing each formerly 16-bit code unit, however expressed, in the wider 21-or-more-bits unit, and reading back likewise, would just work. Sorry if this is all obvious. Mainly I want to throw in my lot with Allen's exception-free literal/constructor approach. The encoding APIs should throw on invalid Unicode but literals and strings as immutable 16-bit storage buffers should work as today. /be ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: New full Unicode for ES6 idea
On Feb 21, 2012, at 7:37 AM, Brendan Eich wrote: Brendan Eich wrote: in open-source browsers and JS engines that use uint16 vectors internally Sorry, that reads badly. All I meant is that I can't tell what closed-source engines do, not that they do not comply with ECMA-262 combined with other web standards to have the same observable effect, e.g. Allen's example: A quick scan of http://code.google.com/p/v8/issues/detail?id=761 suggests that there may be more variability among current browsers than we thought. I haven't tried my original test case in Chrome of IE9 but the discussion in this bug report suggests that their behavior may currently be different from FF. var c = // where the single character between the quotes is the Unicode character U+1f638 c.length == 2; c === \ud83d\ude38; //the two character UTF-16 encoding of 0x1f683 c.charCodeAt(0) == 0xd83d; c.charCodeAt(1) == 0xd338; Still no BRS to set, we need one if we want a full-Unicode outcome (c.length == 1, etc.). /be ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: New full Unicode for ES6 idea
On Tue, Feb 21, 2012 at 3:11 PM, Brendan Eich bren...@mozilla.com wrote: Hi Mark, thanks for this post. Mark Davis ☕ wrote: UTF-8 represents a code point as 1-4 8-bit code units 1-6. ... Lock up your encoders, I am so not a Unicode guru but this is what my reptile coder brain remembers. Only theoretically. UTF-8 has been locked down to the same range that UTF-16 has (RFC 3629), so the largest real character you'll see is 4 bytes, as that gives you exactly 21 bits of data. ~TJ ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
RE: New full Unicode for ES6 idea
Hi Mark, thanks for this post. Mark Davis ☕ wrote: UTF-8 represents a code point as 1-4 8-bit code units 1-6. No. 1 to *4*. Five and six byte UTF-8 sequences are illegal and invalid. UTF-16 represents a code point as 2 or 4 16-bit code units 1 or 2. Yes, 1 or 2 16-bit code units (that's 2 or 4 bytes, of course). Addison Addison Phillips Globalization Architect (Lab126) Chair (W3C I18N WG) Internationalization is not a feature. It is an architecture. ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: New full Unicode for ES6 idea
Thanks, all! That's a relief to know, six bytes always seemed to long but my reptile coder brain was also reptile-coder-lazy and I never dug into it. /be Phillips, Addison wrote: Hi Mark, thanks for this post. Mark Davis ☕ wrote: UTF-8 represents a code point as 1-4 8-bit code units 1-6. No. 1 to *4*. Five and six byte UTF-8 sequences are illegal and invalid. UTF-16 represents a code point as 2 or 4 16-bit code units 1 or 2. Yes, 1 or 2 16-bit code units (that's 2 or 4 bytes, of course). Addison Addison Phillips Globalization Architect (Lab126) Chair (W3C I18N WG) Internationalization is not a feature. It is an architecture. ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: New full Unicode for ES6 idea
I'll reply to Brendan's proposal in two parts: first about the goals for supplementary character support, second about the BRS. Full 21-bit Unicode support means all of: * indexing by characters, not uint16 storage units; * counting length as one greater than the last index; and * supporting escapes with (up to) six hexadecimal digits. For me, full 21-bit Unicode support has a different priority list. First come the essentials: Regular expressions; functions that interpret strings; the overall sense that all Unicode characters are supported. 1) Regular expressions must recognize supplementary characters as atomic entities, and interpret them according to Unicode semantics. Look at the contortions one has to go through currently to describe a simple character class that includes supplementary characters: https://github.com/roozbehp/yui3-gallery/blob/master/src/gallery-intl-bidi/js/intl-bidi.js Read up on why it has to be done this way, and see to what extremes some people are going to make supplementary characters work despite ECMAScript: http://inimino.org/~inimino/blog/javascript_cset Now, try to figure out how you'd convert a user-entered string to a regular expression such that you can search for the string without case distinction, where the string may contain supplementary characters such as жвь (Deseret for one). Regular expressions matter a lot here because, if done properly, they eliminate much of the need for iterating over strings manually. 2) Built-in functions that interpret strings have to recognize supplementary characters as atomic entities and interpret them according to their Unicode semantics. The list of functions in ES5 that violate this principle is actually rather short: Besides the String functions relying on regular expressions (match, replace, search, split), they're the String case conversion functions (toLowerCase, toLocaleLowerCase, toUpperCase, toLocaleUpperCase) and the relational comparison for strings (11.8.5). But the principle is also important for new functionality being considered for ES6 and above. 3) It must be clear that the full Unicode character set is allowed and supported. This means at least getting rid of the reference to UCS-2 (clause 2) and the bizarre equivalence between characters and UTF-16 code units (clause 6). ECMAScript has already defined several ways to create UTF-16 strings containing supplementary characters (parsing UTF-8 source; using Unicode escapes for surrogate pairs), and lets applications freely pass around such strings. Browsers have surrounded ECMAScript implementations with text input, text rendering, DOM APIs, and XMLHTTPRequest with full Unicode support, and generally use full UTF-16 to exchange text with their ECMAScript subsystem. Developers have used this to build applications that support supplementary characters, hacking around the remaining gaps in ECMAScript as seen above. But, as in the bug report that Brendan pointed to this morning (http://code.google.com/p/v8/issues/detail?id=761), the mention of UCS-2 is still used by some to excuse bugs. Only after these essentials come the niceties of String representation and Unicode escapes: 4) 1 String element to 1 Unicode code point is indeed a very nice and desirable relationship. Unlike Java, where binary compatibility between virtual machines made a change from UTF-16 to UTF-32 impossible, JavaScript needs to be compatible only at the source code level - or maybe, with a BRS, not even that. 5) If we don't go for UTF-32, then there should be a few functions to simplify access to strings in terms of code points, such as String.fromCodePoint, String.prototype.codePointAt. 6) I strongly prefer the use of plain characters over Unicode escapes in source code, because plain text is much easier to read than sequences of hex values. However, the need for Unicode escapes is greater in the space of supplementary characters because here we often have to reference characters for which our operating systems don't have glyphs yet. And \u{1D11E} certainly makes it easier to cross-reference a character than \uD834\uDD1E. The new escape syntax therefore should be on the list, at low priority. I think it would help if other people involved in this discussion also clarified what exactly their requirements are for full Unicode support. Norbert On Feb 19, 2012, at 0:33 , Brendan Eich wrote: Once more unto the breach, dear friends! ES1 dates from when Unicode fit in 16 bits, and in those days, nickels had pictures of bumblebees on 'em (Gimme five bees for a quarter, you'd say ;-). Clearly that was a while ago. These days, we would like full 21-bit Unicode character support in JS. Some (mranney at Voxer) contend that it is a requirement. Full 21-bit Unicode support means all of: * indexing by characters, not uint16 storage units; * counting length as one greater than the last index; and * supporting escapes with (up
Re: New full Unicode for ES6 idea
On Feb 21, 2012, at 6:05 PM, Norbert Lindenberg ecmascr...@norbertlindenberg.com wrote: I'll reply to Brendan's proposal in two parts: first about the goals for supplementary character support, second about the BRS. Full 21-bit Unicode support means all of: * indexing by characters, not uint16 storage units; * counting length as one greater than the last index; and * supporting escapes with (up to) six hexadecimal digits. For me, full 21-bit Unicode support has a different priority list. First come the essentials: Regular expressions; functions that interpret strings; the overall sense that all Unicode characters are supported. 1) Regular expressions must recognize supplementary characters as atomic entities, and interpret them according to Unicode semantics. Sorry to have been unclear. In my proposal this follows from the first two bullets. 2) Built-in functions that interpret strings have to recognize supplementary characters as atomic entities and interpret them according to their Unicode semantics. Ditto. 3) It must be clear that the full Unicode character set is allowed and supported. Absolutely. Only after these essentials come the niceties of String representation and Unicode escapes: 4) 1 String element to 1 Unicode code point is indeed a very nice and desirable relationship. Unlike Java, where binary compatibility between virtual machines made a change from UTF-16 to UTF-32 impossible, JavaScript needs to be compatible only at the source code level - or maybe, with a BRS, not even that. Right! 5) If we don't go for UTF-32, then there should be a few functions to simplify access to strings in terms of code points, such as String.fromCodePoint, String.prototype.codePointAt. Those would help smooth out different BRS settings, indeed. 6) I strongly prefer the use of plain characters over Unicode escapes in source code, because plain text is much easier to read than sequences of hex values. However, the need for Unicode escapes is greater in the space of supplementary characters because here we often have to reference characters for which our operating systems don't have glyphs yet. And \u{1D11E} certainly makes it easier to cross-reference a character than \uD834\uDD1E. The new escape syntax therefore should be on the list, at low priority. Allen and I were just discussing this as a desirable mini- strawman of its own, which Allen will write up for consideration at the next meeting. We will also discuss the BRS . Did you have some thoughts on it? I think it would help if other people involved in this discussion also clarified what exactly their requirements are for full Unicode support. Again, apologies for not being explicit. I model the string methods as self-hosted using indexing and .length in straightforward ways. HTH, /be Norbert On Feb 19, 2012, at 0:33 , Brendan Eich wrote: Once more unto the breach, dear friends! ES1 dates from when Unicode fit in 16 bits, and in those days, nickels had pictures of bumblebees on 'em (Gimme five bees for a quarter, you'd say ;-). Clearly that was a while ago. These days, we would like full 21-bit Unicode character support in JS. Some (mranney at Voxer) contend that it is a requirement. Full 21-bit Unicode support means all of: * indexing by characters, not uint16 storage units; * counting length as one greater than the last index; and * supporting escapes with (up to) six hexadecimal digits. ES4 saw bold proposals including Lars Hansen's, to allow implementations to change string indexing and length incompatibly, and let Darwin sort it out. I recall that was when we agreed to support \u{XX} as an extension for spelling non-BMP characters. Allen's strawman from last year, http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings, proposed a brute-force change to support full Unicode (albeit with too many hex digits allowed in \u{...}), observing that There are very few places where the ECMAScript specification has actual dependencies upon the size of individual characters so the compatibility impact of supporting full Unicode is quite small. But two problems remained: P1. As Allen wrote, There is a larger impact on actual implementations, and no implementors that I can recall were satisfied that the cost was acceptable. It might be, we just didn't know, and there are enough signs of high cost to create this concern. P2. The change is not backward compatible. In JS today, one read a string s from somewhere and hard-code, e.g., s.indexOf(0xd800 to find part of a surrogate pair, then advance to the next-indexed uint16 unit and read the other half, then combine to compute some result. Such usage would break. Example from Allen: var c = // where the single character between the quotes is the Unicode character U+1f638 c.length == 2; c === \ud83d\ude38; //the
Re: New full Unicode for ES6 idea
Second part: the BRS. I'm wondering how development and deployment of existing full-Unicode software will play out in the presence of a Big Red Switch. Maybe I'm blind and there are ways to simplify the process, but this is how I imagine it. Let's start with a bit of code that currently supports full Unicode by hacking around ECMAScript's limitations: https://github.com/roozbehp/yui3-gallery/blob/master/src/gallery-intl-bidi/js/intl-bidi.js To support applications running in a BRS-on environment, Roozbeh would have to create a parallel version of the module that (a) takes advantage of regular expressions that finally support supplementary characters and (b) uses the new Unicode escape syntax instead of the old one. The parallel version has to be completely separate because a BRS-on environment would reject the old Unicode escapes and an ES5/BRS-off environment would reject the new Unicode escapes. To get the code tested, he also has to create a parallel version of the test cases. The parallel version would be functionally identical but set up a BRS-on environment and use the new Unicode escape syntax instead of the old one. The parallel version has to be completely separate because a BRS-on environment would reject the old Unicode escapes and an ES5/BRS-off environment would reject the new Unicode escapes. Fortunately the test cases are simple. Then he has to figure out how the two separate versions of the module will get loaded by clients. It's a YUI module, and the YUI loader already has the ability to look at several parameters to figure out what to load (minimized vs. debug version, localized resource bundles, etc.), so maybe the BRS should be another parameter? But the YUI team has a long to-do list, so in the meantime the module gets two separate names, and the client has to figure out which one to request. The first client picking up the new version is another, bigger library. As a library it doesn't control the BRS, so it has to be able to run with both BRS-on and BRS-off. So it has to check the BRS and load the appropriate version of the intl-bidi module at runtime. This means, it also has to be tested in both environments. Its test cases are not simple. So now it needs modifications to the test framework to run the test suite twice, once with BRS-on and once with BRS-off. An application using the library and thus the intl-bidi module decides to take the plunge and switch to BRS-on. It doesn't do text processing itself (that's what libraries are for), and it doesn't use Unicode escapes, so no code changes. But when it throws the switch, exceptions get thrown. It turns out that 3 of the 50 JavaScript files loaded during startup use old Unicode escapes. One of them seems to do something that might affect supplementary characters; for the other two apparently the developers just felt safer escaping all non-ASCII characters. The developers of the application don't actually know anything about the scripts - they got loaded indirectly by apps, ads, and analytics software used by the application. The developers try to find out whom they'll have to educate about the BRS to get this resolved. OK - migrations are hard. But so far most participants have only seen additional work, no benefits. How long will this take? When will it end? When will browsers make BRS-on the default, let alone eliminate the switch? When can Roozbeh abandon his original version? Where's the blue button? The thing to keep in mind is that most code doesn't need to know anything about supplementary characters. The beneficiaries of the switch are only the implementors of functions that do need to know, and even they won't really benefit until the switch is permanently on (at least for all their clients). It seems the switch puts a new burden on many that so far have been rightfully oblivious to supplementary characters. Norbert On Feb 19, 2012, at 0:33 , Brendan Eich wrote: [snip] Allen's strawman from last year, http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings, proposed a brute-force change to support full Unicode (albeit with too many hex digits allowed in \u{...}), observing that There are very few places where the ECMAScript specification has actual dependencies upon the size of individual characters so the compatibility impact of supporting full Unicode is quite small. But two problems remained: P1. As Allen wrote, There is a larger impact on actual implementations, and no implementors that I can recall were satisfied that the cost was acceptable. It might be, we just didn't know, and there are enough signs of high cost to create this concern. P2. The change is not backward compatible. In JS today, one read a string s from somewhere and hard-code, e.g., s.indexOf(0xd800 to find part of a surrogate pair, then advance to the next-indexed uint16 unit and read the other half, then combine to compute some result.