Re: Full Unicode strings strawman

2011-05-17 Thread Norbert Lindenberg
I have read the discussion so far, but would like to come back to the  
strawman itself because I believe that it starts with a problem  
statement that's incorrect and misleading the discussion. Correctly  
describing the current situation would help in the discussion of  
possible changes, in particular their compatibility impact.



The relevant portion of the problem statement:

ECMAScript currently only directly supports the 16-bit basic  
multilingual plane (BMP) subset of Unicode which is all that existed  
when ECMAScript was first designed. [...] As currently defined,  
characters in this expanded character set cannot be used in the source  
code of ECMAScript programs and cannot be directly included in runtime  
ECMAScript string values.



My reading of the ECMAScript Language Specification, edition 5.1  
(January 2011), is:


1) ECMAScript allows, but does not require, implementations to support  
the full Unicode character set.


2) ECMAScript allows source code of ECMAScript programs to contain  
characters from the full Unicode character set.


3) ECMAScript requires implementations to treat String values as  
sequences of UTF-16 code units, and defines key functionality based on  
an interpretation of String values as sequences of UTF-16 code units,  
not based on an interpretation as sequences of Unicode code points.


4) ECMAScript prohibits implementations from conforming to the Unicode  
standard with regards to case conversions.



The relevant text portions leading to these statements are:

1) Section 2, Conformance: A conforming implementation of this  
Standard shall interpret characters in conformance with the Unicode  
Standard, Version 3.0 or later and ISO/IEC 10646-1 with either UCS-2  
or UTF-16 as the adopted encoding form, implementation level 3. If the  
adopted ISO/IEC 10646-1 subset is not otherwise specified, it is  
presumed to be the BMP subset, collection 300. If the adopted encoding  
form is not otherwise specified, it presumed to be the UTF-16 encoding  
form.


To interpret this, note that the Unicode Standard, Version 3.1 was the  
first one to encode actual supplementary characters [1], and that the  
only difference between UCS-2 and UTF-16 is that UTF-16 supports  
supplementary characters while UCS-2 does not [2].


2) Section 6, Source Text: ECMAScript source text is represented as a  
sequence of characters in the Unicode character encoding, version 3.0  
or later. [...] ECMAScript source text is assumed to be a sequence of  
16-bit code units for the purposes of this specification. [...] If an  
actual source text is encoded in a form other than 16-bit code units  
it must be processed as if it was first converted to UTF-16.


To interpret this, note again that the Unicode Standard, Version 3.1  
was the first one to encode actual supplementary characters, and that  
the conversion requirement enables the use of supplementary characters  
represented as 4-byte UTF-8 characters in source text. As UTF-8 is now  
the most commonly used character encoding on the web [3], the 4-byte  
UTF-8 representation, not Unicode escape sequences, should be seen as  
the normal representation of supplementary characters in ECMAScript  
source text.


3) Section 6, Source Text: If an actual source text is encoded in a  
form other than 16-bit code units it must be processed as if it was  
first converted to UTF-16. [...] Throughout the rest of this document,  
the phrase “code unit” and the word “character” will be used to  
refer to a 16-bit unsigned value used to represent a single 16-bit  
unit of text. Section 15.5.4.4, String.prototype.charCodeAt(pos):  
Returns a Number (a nonnegative integer less than 2**16) representing  
the code unit value of the character at position pos in the String  
resulting from converting this object to a String. Section 15.5.5.1  
length: The number of characters in the String value represented by  
this String object.


I don't like that the specification redefines a commonly used term  
such as character to mean something quite different (code unit),  
and hides that redefinition in a section on source text while applying  
it primarily to runtime behavior. But there it is: Thanks to the  
redefinition, it's clear that charCodeAt() returns UTF-16 code units,  
and that the length property holds the number of UTF-16 code units in  
the string.


4) Section 15.5.4.16, String.prototype.toLowerCase(): For the  
purposes of this operation, the 16-bit code units of the Strings are  
treated as code points in the Unicode Basic Multilingual Plane.  
Surrogate code points are directly transferred from S to L without any  
mapping.


This does not meet Conformance Requirement C8 of the Unicode Standard,  
Version 6.0 [4]: When a process interprets a code unit sequence which  
purports to be in a Unicode character encoding form, it shall  
interpret that code unit sequence according to the corresponding code  
point sequence.



References:

[1] 

RE: Use cases for WeakMap

2011-05-17 Thread Hudson, Rick

This is all a bit off topic but performance does matter and folks seem to be 
underestimating the wealth of community knowledge that exists in this area.

Who underestimates?

Sorry, this wasn't meant to slight anyone.  I have spent a career standing on 
the shoulders of Allen and his colleagues. My respect should not be 
underestimated.

Interesting pointer.


-Rick

From: Brendan Eich [mailto:bren...@mozilla.com]
Sent: Monday, May 16, 2011 6:44 PM
To: Hudson, Rick
Cc: Allen Wirfs-Brock; Oliver Hunt; Andreas Gal; es-discuss
Subject: Re: Use cases for WeakMap

On May 16, 2011, at 2:46 PM, Hudson, Rick wrote:


This is all a bit off topic but performance does matter and folks seem to be 
underestimating the wealth of community knowledge that exists in this area.

Who underestimates?

A bunch of us are aware of all this. Allen certainly knows all about it, and 
we've been talking shop with him for years, long before he joined Mozilla :-P. 
I recall a conversation like this one about sparse hashcode implementation with 
Allen, Lars Thomas Hansen (then of Opera), and Graydon Hoare from four or five 
years ago...

http://wiki.ecmascript.org/doku.php?id=proposals:hashcodes (check the history)

However, in this thread, the issue is not optimizing hashcode or other metadata 
sparsely associated with objects. That's a good thing, implementations should 
do it. Having the hashcode in the object wins, compared to having it 
(initially) in a side table, but who's counting?

The issue under dispute was neither sparse hashcode nor sparse fish property 
association, where the property would be accessed by JS user code that 
referenced the containing object itself. Rather, it was whether a frozen object 
needed any hidden mutable state to be a key in a WeakMap. And since this state 
would be manipulated by the GC, it matters if it's in the object, since the GC 
would be touching more potentially randomly distributed memory, thrashing more 
cache.

So far as I can tell, there's no demonstrated need for this hidden-mutable 
key-in-weakmap object state. And it does seem that touching key objects 
unnecessarily will hurt weakmap-aware GC performance. But I may be 
underestimating... :-/

/be

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Full Unicode strings strawman

2011-05-17 Thread Wes Garland
On 16 May 2011 17:42, Boris Zbarsky bzbar...@mit.edu wrote:

 On 5/16/11 4:38 PM, Wes Garland wrote:

 Two great things about strings composed of Unicode code points:

 ...

  If though this is a breaking change from ES-5, I support it
 whole-heartedly but I expect breakage to be very limited. Provided
 that the implementation does not restrict the storage of reserved code
 points (D800-DF00)


 Those aren't code points at all.  They're just not Unicode.


Not quite: code points D800-DFFF are reserved code points which are not
representable with UTF-16. Definition D71, Unicode 6.0.


 If you allow storage of such, then you're allowing mixing Unicode strings
 and something else (whatever the something else is), with bad most likely
 bad results.


I don't believe this is true. We are merely allowing storage of Unicode
strings which cannot be converted into UTF-16.   That allows us to maintain
most of the existing String behaviour  (arbitrary array of uint16), although
overflowing like this would break:

a = String.fromCharCode(str.charCodeAt(0) + 1)

when str[0] is 0+.


 Most simply, assignign a DOMString containing surrogates to a JS string
 should collapse the surrogate pairs into the corresponding codepoint if JS
 strings really contain codepoints...

 The only way to make this work is if either DOMString is redefined or
 DOMString and full Unicode strings are different kinds of objects.


  Users doing surrogate pair decomposition will probably find that their
 code just works


 How, exactly?


/** Untested and not rigourous */
function unicode_strlen(validUnicodeString) {
  var length = 0;
  for (var i = 0; i  validUnicodeString.length; i++)  {
if (validUnicodeString.charCodeAt(i) = 0xd800 
validUnicodeString.charCodeAt(i) = 0xdc00)
  continue;
length++;
  }
  return length;
}

Code like this  which looks for surrogate pairs in valid Unicode strings
will simply not find them, instead only finding code points which seem to
the same size as the code unit.



  Users creating Strings with surrogate pairs will need to
 re-tool


 Such users would include the DOM, right?


I am hopeful that most web browsers have one or few interfaces between DOM
strings and JS strings.  I do not know if my hopes reflect reality.


  but this is a small burden and these users will be at the upper
 strata of Unicode-foodom.


 You're talking every single web developer here.  Or at least every single
 web developer who wants to work with Devanagari text.


I don't think so.  I bet if we could survey web developers across the
industry (rather than just top-tier people who tend to participate in
discussions like this one), we would find that the vast major of them never
both handling non-BMP cases, and do not test non-BMP cases.

Heck, I don't even know if a non-BMP character can be data-entered into an
input type=text maxlength=1 or not. (Do you? What happens?)


  I suspect that 99.99% of users will find that
 this change will fix bugs in their code when dealing with non-BMP
 characters.


 Not unless DOMString is changed or the interaction between the two very
 carefully defined in failure-proof ways.


Yes, I was dismayed to find out that DOMString defines UTF-16.

We could get away with converting UTF-16 at DOMString  JSString transition
point.  This might mean that it is possible that JSString=DOMString would
throw, as full Unicode Strings could contain code points which are not
representable in UTF-16.

If don't throw on invalid-in-UTF-16 code points, then round-tripping is
lossy. If it does, that's silly.


 It needed to specify _something_, and UTF-16 was the thing that was
 compatible with how scripts work in ES.  Not to mention the Java legacy if
 the DOM...


By this comment, I am inferring then that DOM and JS Strings share their
backing store.  From an API-cleanliness point of view, that's too bad. From
an implementation POV, it makes sense.  Actually, it makes even more sense
when I recall the discussion we had last week when you explained how
external strings etc work in SpiderMonkey/Gecko.

Do all the browsers share JS/DOM String backing stores?

 It is an unfortunate accident of history that UTF-16 surrogate pairs leak
 their
 abstraction into ES Strings, and I believe it is high time we fixed that.


If you can do that without breaking web pages, great.  If not, then we need
 to talk.  ;)


There is no question in mind that this proposal would break Unicode-aware
JS.  It is my belief that that doesn't matter if it accompanies other major,
opt-in changes.

Resolving DOM String  JS String interchange is a little trickier, but I
think it can be managed if we can allow JS=DOM to throw when high surrogate
code points are encountered in the JS String.  It might mean extra copying,
or it might not if the DOM implementation already uses UTF-8 internally.

Wes

-- 
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102

Re: arrow syntax unnecessary and the idea that function is too long

2011-05-17 Thread Brendan Eich
On May 16, 2011, at 7:55 PM, Peter Michaux wrote:

 On Mon, May 9, 2011 at 6:02 PM, Brendan Eich bren...@mozilla.com wrote:
 
 Yes, and we could add call/cc to make (some) compiler writers even happier. 
 But users would shoot off all their toes with this footgun, and some 
 implementors would be hard-pressed to support it. The point is *not * to do 
 any one change that maximizes benefits to some parties while harming others.
 
 By the nature of their task and its complexity, compiler writers
 targeting JavaScript need JavaScript to have features that make it
 possible to generate efficient compiled code. Without the big features
 like call/cc there are many things that just cannot be compiled well
 enough...which ultimately means all the languages that compile to
 JavaScript are just thin sugar layers that really aren't even worth
 the bother. Those languages, like Coffeescript, are obscure and known
 by only a few people.

Rails 3.1 is obscure and known by only a few people?

Seriously, check your attitude. There are many languages in the world. 
Asserting that Python implemented via Skulpt is not worth the bother is 
insulting to people working on that project and using it. If you don't think 
it's worth the bother, feel free not to bother. Your opinion does not become an 
imperative to add arbitrary compiler-target wishlist items.

Here's a list of languages that compile to JS:

https://github.com/jashkenas/coffee-script/wiki/List-of-languages-that-compile-to-JS

I'm sure it's not complete.


 The goal of pleasing compiler writers should be
 to make it possible to compile existing languages like Perl, Ruby,
 Python and Scheme to JavaScript. These languages are the ones that
 people know and really want to use and target their compilers to
 JavaScript.

This is not a straight-up discussion. You ignore safety, never mind usability. 
Compiler writers want unsafe interfaces to machine-level abstractions. Should 
we expose them? Certainly not, even though not exposing them hurts efforts to 
compile (not transpile, as you note) other languages to JS.

Too bad -- the first order of business is JS as a source language. Being a 
better target for compilers is secondary. It is among the goals, but not 
super-ordinate.

http://wiki.ecmascript.org/doku.php?id=harmony:harmony

Compiler-writers don't seem to be having such a bad time of it, and we can 
proceed on a more concrete requirements proposal basis than taking 
absolute-sounding philosophical stances.

/be

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Full Unicode strings strawman

2011-05-17 Thread Boris Zbarsky

On 5/17/11 10:40 AM, Wes Garland wrote:

On 16 May 2011 17:42, Boris Zbarsky bzbar...@mit.edu
Those aren't code points at all.  They're just not Unicode.

Not quite: code points D800-DFFF are reserved code points which are not
representable with UTF-16.


Nor with any other Unicode encoding, really.  They don't represent, on 
their own, Unicode characters.



If you allow storage of such, then you're allowing mixing Unicode
strings and something else (whatever the something else is), with
bad most likely bad results.

I don't believe this is true. We are merely allowing storage of Unicode
strings which cannot be converted into UTF-16.


No, you're allowing storage of some sort of number arrays that don't 
represent Unicode strings at all.



Users doing surrogate pair decomposition will probably find that
their code just works

How, exactly?

/** Untested and not rigourous */
function unicode_strlen(validUnicodeString) {
   var length = 0;
   for (var i = 0; i  validUnicodeString.length; i++)  {
 if (validUnicodeString.charCodeAt(i) = 0xd800 
validUnicodeString.charCodeAt(i) = 0xdc00)
   continue;
 length++;
   }
   return length;
}

Code like this  which looks for surrogate pairs in valid Unicode
strings will simply not find them, instead only finding code points
which seem to the same size as the code unit.


Right, so if it's looking for non-BMP characters in the string, say, 
instead of computing the length, it won't find them.  How the heck is 
that just works?



Users creating Strings with surrogate pairs will need to
re-tool

Such users would include the DOM, right?

I am hopeful that most web browsers have one or few interfaces between
DOM strings and JS strings.


A number of web browsers have an interface between DOM and JS strings 
that consists of either memcpy or addref the buffer.


  I do not know if my hopes reflect reality.

They probably do, so you're only really talking about at least 10 
different places across at least 5 different codebases that have to be 
fixed, in a coordinated way...




You're talking every single web developer here.  Or at least every
single web developer who wants to work with Devanagari text.

I don't think so.  I bet if we could survey web developers across the
industry (rather than just top-tier people who tend to participate in
discussions like this one), we would find that the vast major of them
never both handling non-BMP cases, and do not test non-BMP cases.


And how many of them use libraries that handle that for them?

And how many implicitly rely on DOM-to-JS roundtripping without 
explicitly doing anything with non-BMP chars or surrogate pairs?



Heck, I don't even know if a non-BMP character can be data-entered into
an input type=text maxlength=1 or not. (Do you? What happens?)


It cannot in Gecko, as I recall; there maxlength is interpreted as 
number of UTF-16 code units.


In WebKit, maxlength is interpreted as the number of grapheme clusters 
based on my look at their code just now.


I don't know offhand about Presto and Trident, for obvious reasons.


We could get away with converting UTF-16 at DOMString  JSString
transition point.


What would that even mean?  DOMString is defined to be an ES string in 
the ES binding right now.  Is the proposal to have some other kind of 
object for DOMString (so that, for example, String.prototype would no 
longer affect the behavior of DOMString the way it does now)?



This might mean that it is possible that
JSString=DOMString would throw, as full Unicode Strings could contain
code points which are not representable in UTF-16.


How is that different from sticking non-UTF-16 into an ES string right now?


If don't throw on invalid-in-UTF-16 code points, then round-tripping is
lossy. If it does, that's silly.


So both options suck, yes?  ;)


It needed to specify _something_, and UTF-16 was the thing that was
compatible with how scripts work in ES.  Not to mention the Java
legacy if the DOM...

By this comment, I am inferring then that DOM and JS Strings share their
backing store.


That's not what the comment was about, actually.  The comment was about API.

But yes, in many cases they do share backing store.


Do all the browsers share JS/DOM String backing stores?


Gecko does in some cases.

WebKit+JSC does in all cases, I believe (or at least a large majority of 
cases).


I don't know about others.


There is no question in mind that this proposal would break
Unicode-aware JS.


As far as I can tell it would also break Unicode-unaware JS.


It is my belief that that doesn't matter if it
accompanies other major, opt-in changes.


It it's opt-in, perhaps.


Resolving DOM String  JS String interchange is a little trickier, but
I think it can be managed if we can allow JS=DOM to throw when high
surrogate code points are encountered in the JS String.


I'm 99% sure this would break sites.


It might mean 

Re: Full Unicode strings strawman

2011-05-17 Thread Brendan Eich
On May 16, 2011, at 8:13 PM, Allen Wirfs-Brock wrote:

 I think it does. In another reply I also mentioned the possibility of tagging 
 in a JS visible manner strings that have gone through a known encoding 
 process.

Saw that, seems helpful. Want to spec it?


 If the strings you are combining from different sources have not been 
 canonicalize to a common encoding then you better be damn care  how you 
 combine them.

Programmers miss this as you note, so arguably things are not much worse, at 
best no worse, with your proposal.

Your strawman does change the game, though, hence the global or cross-cutting 
(non-modular) concern. I'm warm to it, after digesting. It's about time we get 
past the 90's!


 The DOM seems seems to canonicalize to UTF-16 (with some slop WRT invalid 
 encoding that Boris and others have pointed out).  I don't about other 
 sources such as XMLHttpRequest or the file APIs.  However, in the long run JS 
 in the browser is going to have to be able to deal with arbitrary encodings.  
 You can hide such things form many programmers but not all.  After all, 
 people actually have to implement transcoders.

Transcoding to some canonical Unicode representation is often done by the 
browser upstream of JS, and that's a good thing. Declarative specification by 
authors, implementation by relative-few browser i18n gurus, sparing the many JS 
devs the need to worry. This is good, I claim.

That it means JS hackers are careless about Unicode is inevitable, and there 
are other reasons for that condition anyway. At least with your strawman there 
will be full Unicode flowing through JS and back into the DOM and layout.

/be

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Full Unicode strings strawman

2011-05-17 Thread Boris Zbarsky

On 5/17/11 1:05 PM, Brendan Eich wrote:

If the strings you are combining from different sources have not been 
canonicalize to a common encoding then you better be damn care  how you combine 
them.


Programmers miss this as you note, so arguably things are not much worse, at 
best no worse, with your proposal.


Right now, by the time a string gets into JS in browsers it's been 
canonicalized into UTF-16, to the best of the browser's ability, unless 
you explicitly tell it otherwise (e.g. with the user-defined charset 
hackery on XHR).



The DOM seems seems to canonicalize to UTF-16 (with some slop WRT invalid 
encoding that Boris and others have pointed out).  I don't about other sources 
such as XMLHttpRequest or the file APIs.  However, in the long run JS in the 
browser is going to have to be able to deal with arbitrary encodings.  You can 
hide such things form many programmers but not all.  After all, people actually 
have to implement transcoders.


Transcoding to some canonical Unicode representation is often done by the 
browser upstream of JS, and that's a good thing. Declarative specification by 
authors, implementation by relative-few browser i18n gurus, sparing the many JS 
devs the need to worry. This is good, I claim.


Yes.  And right now that's how it works and actual JS authors typically 
don't have to worry about encoding issues.  I don't agree with Allen's 
claim that in the long run JS in the browser is going to have to be 
able to deal with arbitrary encodings.  Having the _capability_ might 
be nice, but forcing all web authors to think about it seems like a 
non-starter.



That it means JS hackers are careless about Unicode is inevitable, and there 
are other reasons for that condition anyway. At least with your strawman there 
will be full Unicode flowing through JS and back into the DOM and layout.


See, this is the part I don't follow.  What do you mean by full 
Unicode and how do you envision it flowing?


-Boris
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Full Unicode strings strawman

2011-05-17 Thread Brendan Eich
On May 17, 2011, at 10:22 AM, Boris Zbarsky wrote:

 Yes.  And right now that's how it works and actual JS authors typically don't 
 have to worry about encoding issues.  I don't agree with Allen's claim that 
 in the long run JS in the browser is going to have to be able to deal with 
 arbitrary encodings.  Having the _capability_ might be nice, but forcing all 
 web authors to think about it seems like a non-starter.

Allen said be able to, not forcing. Big difference. I think we three at 
least are in agreement here.


 
 That it means JS hackers are careless about Unicode is inevitable, and there 
 are other reasons for that condition anyway. At least with your strawman 
 there will be full Unicode flowing through JS and back into the DOM and 
 layout.
 
 See, this is the part I don't follow.  What do you mean by full Unicode and 
 how do you envision it flowing?

I mean UTF-16 flowing through, but as you say that happens now -- but (I reply) 
only if JS doesn't mess with things in a UCS-2 way (indexing 16-bits at a time, 
ignoring surrogates). And JS code does generally assume 16 bits are enough.

With Allen's proposal we'll finally have some new APIs for JS developers to use.

/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Full Unicode strings strawman

2011-05-17 Thread Boris Zbarsky

On 5/17/11 1:27 PM, Brendan Eich wrote:

On May 17, 2011, at 10:22 AM, Boris Zbarsky wrote:


Yes.  And right now that's how it works and actual JS authors typically don't have to 
worry about encoding issues.  I don't agree with Allen's claim that in the long run 
JS in the browser is going to have to be able to deal with arbitrary encodings.  
Having the _capability_ might be nice, but forcing all web authors to think about it 
seems like a non-starter.


Allen said be able to, not forcing. Big difference. I think we three at 
least are in agreement here.


I think we're in agreement on the sentiment, but perhaps not on where on 
the able to to forcing spectrum this strawman falls.



See, this is the part I don't follow.  What do you mean by full Unicode and 
how do you envision it flowing?


I mean UTF-16 flowing through, but as you say that happens now -- but (I reply) 
only if JS doesn't mess with things in a UCS-2 way (indexing 16-bits at a time, 
ignoring surrogates). And JS code does generally assume 16 bits are enough.

With Allen's proposal we'll finally have some new APIs for JS developers to use.


That doesn't answer my questions

-Boris
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Full Unicode strings strawman

2011-05-17 Thread Brendan Eich
On May 17, 2011, at 10:37 AM, Boris Zbarsky wrote:

 On 5/17/11 1:27 PM, Brendan Eich wrote:
 On May 17, 2011, at 10:22 AM, Boris Zbarsky wrote:
 
 Yes.  And right now that's how it works and actual JS authors typically 
 don't have to worry about encoding issues.  I don't agree with Allen's 
 claim that in the long run JS in the browser is going to have to be able 
 to deal with arbitrary encodings.  Having the _capability_ might be nice, 
 but forcing all web authors to think about it seems like a non-starter.
 
 Allen said be able to, not forcing. Big difference. I think we three at 
 least are in agreement here.
 
 I think we're in agreement on the sentiment, but perhaps not on where on the 
 able to to forcing spectrum this strawman falls.

Where do you read forcing? Not in the words you cited.


 See, this is the part I don't follow.  What do you mean by full Unicode 
 and how do you envision it flowing?
 
 I mean UTF-16 flowing through, but as you say that happens now -- but (I 
 reply) only if JS doesn't mess with things in a UCS-2 way (indexing 16-bits 
 at a time, ignoring surrogates). And JS code does generally assume 16 bits 
 are enough.
 
 With Allen's proposal we'll finally have some new APIs for JS developers to 
 use.
 
 That doesn't answer my questions

Ok, full Unicode means non-BMP characters not being wrongly treated as two 
uint16 units and miscounted, separated or partly deleted by splicing and 
slicing, etc.

IOW, JS grows to treat strings as full Unicode, not uint16 vectors. This is a 
big deal!

Hope this helps,

/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Full Unicode strings strawman

2011-05-17 Thread Boris Zbarsky

On 5/17/11 1:40 PM, Brendan Eich wrote:

On May 17, 2011, at 10:37 AM, Boris Zbarsky wrote:


On 5/17/11 1:27 PM, Brendan Eich wrote:

On May 17, 2011, at 10:22 AM, Boris Zbarsky wrote:


Yes.  And right now that's how it works and actual JS authors typically don't have to 
worry about encoding issues.  I don't agree with Allen's claim that in the long run 
JS in the browser is going to have to be able to deal with arbitrary encodings.  
Having the _capability_ might be nice, but forcing all web authors to think about it 
seems like a non-starter.


Allen said be able to, not forcing. Big difference. I think we three at 
least are in agreement here.


I think we're in agreement on the sentiment, but perhaps not on where on the able to to 
forcing spectrum this strawman falls.


Where do you read forcing? Not in the words you cited.


In the substance of having strings in different encodings around at the 
same time.  If that doesn't force developers to worry about encodings, 
what does, exactly?



I mean UTF-16 flowing through, but as you say that happens now -- but (I reply) 
only if JS doesn't mess with things in a UCS-2 way (indexing 16-bits at a time, 
ignoring surrogates). And JS code does generally assume 16 bits are enough.

With Allen's proposal we'll finally have some new APIs for JS developers to use.


That doesn't answer my questions


Ok, full Unicode means non-BMP characters not being wrongly treated as two 
uint16 units and miscounted, separated or partly deleted by splicing and 
slicing, etc.

IOW, JS grows to treat strings as full Unicode, not uint16 vectors. This is a 
big deal!


OK, but still allows sticking non-Unicode gunk into the strings, right? 
 So they're still vectors of something.  Whatever that something is.



Hope this helps,


Halfway.  The DOM interaction questions remain unanswered.  Seriously, I 
think we should try to make a list of the issues there, the pitfalls 
that would arise for web developers as a result, then go through and see 
how and whether to address them.  Then we'll have a good basis for 
considering the web compat impact


-Boris
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Full Unicode strings strawman

2011-05-17 Thread Brendan Eich
On May 17, 2011, at 10:43 AM, Boris Zbarsky wrote:

 On 5/17/11 1:40 PM, Brendan Eich wrote:
 Where do you read forcing? Not in the words you cited.
 
 In the substance of having strings in different encodings around at the same 
 time.  If that doesn't force developers to worry about encodings, what does, 
 exactly?

Where in the strawman is anything of that kind observably (to JS authors) 
proposed?


 Ok, full Unicode means non-BMP characters not being wrongly treated as two 
 uint16 units and miscounted, separated or partly deleted by splicing and 
 slicing, etc.
 
 IOW, JS grows to treat strings as full Unicode, not uint16 vectors. This 
 is a big deal!
 
 OK, but still allows sticking non-Unicode gunk into the strings, right?  So 
 they're still vectors of something.  Whatever that something is.

Yes, old APIs for building strings, e.g. String.fromCharCode, still build gunk 
strings, aka uint16 data hacked into strings. New APIs for characters. This 
has to apply to internal JS engine / DOM implemnetation APIs as needed, too.


 Hope this helps,
 
 Halfway.  The DOM interaction questions remain unanswered.  Seriously, I 
 think we should try to make a list of the issues there, the pitfalls that 
 would arise for web developers as a result, then go through and see how and 
 whether to address them.  Then we'll have a good basis for considering the 
 web compat impact

Good idea.

/be

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Full Unicode strings strawman

2011-05-17 Thread Brendan Eich
On May 17, 2011, at 10:47 AM, Brendan Eich wrote:

 On May 17, 2011, at 10:43 AM, Boris Zbarsky wrote:
 
 On 5/17/11 1:40 PM, Brendan Eich wrote:
 Where do you read forcing? Not in the words you cited.
 
 In the substance of having strings in different encodings around at the same 
 time.  If that doesn't force developers to worry about encodings, what does, 
 exactly?
 
 Where in the strawman is anything of that kind observably (to JS authors) 
 proposed?

The flag idea just mooted in this thread is not addressing new problem -- we 
can have such mixing bugs today. True, the odds may go up for such bugs in the 
future (hard to assess whether or how much).

At least with new APIs for characters not gunk-units, we can detect mixtures 
dynamically. This still seems a good idea but it is not essential (yet) and it 
is nowhere near forcing developers to worry about encodings.

/be

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: arrow syntax unnecessary and the idea that function is too long

2011-05-17 Thread Axel Rauschmayer
On May 17, 2011, at 4:57, Peter Michaux petermich...@gmail.com wrote:

 The goal of pleasing compiler writers should be
 to make it possible to compile existing languages like Perl, Ruby,
 Python and Scheme to JavaScript. These languages are the ones that
 people know and really want to use and target their compilers to
 JavaScript.


You sound like you really hate JavaScript and can’t imagine working with it 
unless some other language is compiled to it.

I’ve programmed quite a bit of Perl, Python, and Scheme and found that once you 
get to know the proverbial “good parts” of JavaScript, it can be quite elegant. 
That is, I don’t miss either of these three languages, except maybe for 
Python’s runtime library (and Java’s tools, but that’s a different topic).

With the increasing momentum behind JavaScript, IMHO the primary goal should be 
to improve the language for people who actually want to program in it. This is 
difficult enough, given all the parties that have to be pleased. Listening to 
feedback from compiler writers should be a secondary goal.

-- 
Dr. Axel Rauschmayer

a...@rauschma.de
twitter.com/rauschma

home: rauschma.de
blog: 2ality.com



___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


RE: Full Unicode strings strawman

2011-05-17 Thread Shawn Steele
I would much prefer changing UCS-2 to UTF-16, thus formalizing that 
surrogate pairs are permitted.  That'd be very difficult to break any existing 
code and would still allow representation of everything reasonable in Unicode.  

That would enable Unicode, and allow extending string literals and regular 
expressions for convenience with the U+10 style notation (which would be 
equivalent to the surrogate pair).  The character code manipulation functions 
could be similarly augmented without breaking anything (and maybe not needing 
different names?)

You might want to qualify the UTF-16 as allowing, but strongly discouraging, 
lone surrogates for those people who didn't realize their binary data wasn't a 
string.

The sole disadvantage would be that iterating through a string would require 
consideration of surrogates, same as today.  The same caution is also necessary 
to avoid splitting Ä (U+0041 U+0308) into its component A and   ̈ parts.  I 
wouldn't be opposed to some sort of helper functions or classes that aided in 
walking strings, preferably with options to walk the graphemes (or whatever), 
not just the surrogate pairs.  FWIW: we have such a helper for surrogates in 
.Net and nobody uses them.  The most common feedback is that it's not that 
helpful because it doesn't deal with the graphemes.

- Shawn

shawn.ste...@microsoft.com
Senior Software Design Engineer
Microsoft Windows

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Full Unicode strings strawman

2011-05-17 Thread Boris Zbarsky

On 5/17/11 1:47 PM, Brendan Eich wrote:

On May 17, 2011, at 10:43 AM, Boris Zbarsky wrote:


On 5/17/11 1:40 PM, Brendan Eich wrote:

Where do you read forcing? Not in the words you cited.


In the substance of having strings in different encodings around at the same 
time.  If that doesn't force developers to worry about encodings, what does, 
exactly?


Where in the strawman is anything of that kind observably (to JS authors) 
proposed?


The strawman is silent on the matter.

It was proposed by Allen in the discussion about how the strawman 
interacts with the DOM.


-Boris
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Full Unicode strings strawman

2011-05-17 Thread Wes Garland
On 17 May 2011 12:36, Boris Zbarsky bzbar...@mit.edu wrote:

 Not quite: code points D800-DFFF are reserved code points which are not

 representable with UTF-16.


 Nor with any other Unicode encoding, really.  They don't represent, on
 their own, Unicode characters.


Right - but they are still legitimate code points, and they fill out the
space required to let us treat String as uint16[] when defining the backing
store as something that maps to the set of all Unicode code points.

That said, you can encode these code points with utf-8; for example, 0xdc08
becomes 0xed 0xb0 0x88.

No, you're allowing storage of some sort of number arrays that don't
 represent Unicode strings at all.


No, if I understand Allen's proposal correctly, we're allowing storage of
some sort of number arrays that may contain reserved code points, some of
which cannot be represented in UTF-16.

This isn't that different from the status quo; it is possible right now to
generate JS Strings which are not valid UTF-16 by creating invalid surrogate
pairs.

Keep in mind, also, that even a sequence of random bytes is a valid Unicode
string. The standard does not require that they be well-formed. (D80)


 Right, so if it's looking for non-BMP characters in the string, say,
 instead of computing the length, it won't find them.  How the heck is that
 just works?


My untested hypothesis is that the vast majority of JS code looking for
non-BMP characters is looking for them in order to call them out for special
processing, because the code unit and code point size are different.  When
they don't need special processing, they don't need to be found.  Since the
high-surrogate code points do not appear in well-formed Unicode strings,
they will not be found, and the unneeded special processing will not
happen.  This train of clauses forms the basis for my opinion that, for the
majority of folks, things will just work.


 What would that even mean?  DOMString is defined to be an ES string in the
 ES binding right now.  Is the proposal to have some other kind of object for
 DOMString (so that, for example, String.prototype would no longer affect the
 behavior of DOMString the way it does now)?


Wait, are DOMStrings formally UTF-16, or are they ES Strings?



  This might mean that it is possible that
 JSString=DOMString would throw, as full Unicode Strings could contain
 code points which are not representable in UTF-16.


 How is that different from sticking non-UTF-16 into an ES string right now?


Currently, JS Strings are effectively arrays of 16-bit code units, which are
indistinguishable from 16-bit Unicode strings (D82).  This means that a JS
application can use JS Strings as arrays of uint16, and expect to be able to
round-trip all strings, even those which are not well-formed, through a
UTF-16 DOM.

If we redefine JS Strings to be arrays of Unicode code points, then the JS
application can use JS Strings as arrays uint21 -- but round-tripping the
high-surrogate code points through a UTF-16 layer would not work.



  It might mean extra copying, or it might not if the DOM implementation
 already uses
 UTF-8 internally.


 Uh... what does UTF-8 have to do with this?


If you're already storing UTF-8 strings internally, then you are already
doing something expensive (like copying) to get their code units into and
out of JS; so no incremental perf impact by not having a common UTF-16
backing store.


 (As a note, Gecko and WebKit both use UTF-16 internally; I would be
 _really_ surprised if Trident does not.  No idea about Presto.)


FWIW - last I time I scanned the v8 sources, it appeared to use a
three-representation class, which could store either ASCII, UCS2, or UTF-8.
Presumably ASCII could also be ISO-Latin-1, as both are exact, naive,
byte-sized UCS2/UTF-16 subsets.

Wes

-- 
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Full Unicode strings strawman

2011-05-17 Thread Boris Zbarsky

On 5/17/11 2:12 PM, Wes Garland wrote:

That said, you can encode these code points with utf-8; for example,
0xdc08 becomes 0xed 0xb0 0x88.


By the same argument, you can encode them in UTF-16.  The byte sequence 
above is not valid UTF-8.  See How do I convert an unpaired UTF-16 
surrogate to UTF-8? at http://unicode.org/faq/utf_bom.html which says:


  A different issue arises if an unpaired surrogate is encountered
  when converting ill-formed UTF-16 data. By represented such an
  unpaired surrogate on its own as a 3-byte sequence, the resulting
  UTF-8 data stream would become ill-formed. While it faithfully
  reflects the nature of the input, Unicode conformance requires that
  encoding form conversion always results in valid data stream.
  Therefore a converter must treat this as an error.

(fwiw, this is the third hit on Google for utf-8 surrogates right 
after the Wikipedia articles on UTF-8 and UTF-16, so it's not like it's 
hard to find this information).



No, you're allowing storage of some sort of number arrays that don't
represent Unicode strings at all.

No, if I understand Allen's proposal correctly, we're allowing storage
of some sort of number arrays that may contain reserved code points,
some of which cannot be represented in UTF-16.


See above.  You're allowing number arrays that may or may not be 
interpretable as Unicode strings, period.



This isn't that different from the status quo; it is possible right now
to generate JS Strings which are not valid UTF-16 by creating invalid
surrogate pairs.


True.  However right now no one is pretending that strings are anything 
other than arrays of 16-bit units.



Keep in mind, also, that even a sequence of random bytes is a valid
Unicode string. The standard does not require that they be well-formed.
(D80)


Uh...  A sequence of _bytes_ is not anything related to Unicode unless 
you know how it's encoded.


Not sure what (D80) is supposed to mean.


Right, so if it's looking for non-BMP characters in the string, say,
instead of computing the length, it won't find them.  How the heck
is that just works?

My untested hypothesis is that the vast majority of JS code looking for
non-BMP characters is looking for them in order to call them out for
special processing, because the code unit and code point size are
different.  When they don't need special processing, they don't need to
be found.


This hypothesis is worth testing before being blindly inflicted on the web.


What would that even mean?  DOMString is defined to be an ES string
in the ES binding right now.  Is the proposal to have some other
kind of object for DOMString (so that, for example, String.prototype
would no longer affect the behavior of DOMString the way it does now)?

Wait, are DOMStrings formally UTF-16, or are they ES Strings?


DOMStrings are formally UTF-16 in the DOM spec.

They are defined to be ES strings in the ES binding for the DOM.

Please be careful to not confuse the DOM and its language bindings.

One could change the ES binding to use a non-ES-string object to 
preserve the DOM's requirement that strings be sequences of UTF-16 code 
units.  I'd expect this would break the web unless one is really careful 
doing it...



How is that different from sticking non-UTF-16 into an ES string
right now?

Currently, JS Strings are effectively arrays of 16-bit code units, which
are indistinguishable from 16-bit Unicode strings


Yes.


(D82)


?


This means that a JS application can use JS Strings as arrays of uint16, and 
expect
to be able to round-trip all strings, even those which are not
well-formed, through a UTF-16 DOM.


Yep.  And they do.


If we redefine JS Strings to be arrays of Unicode code points, then the
JS application can use JS Strings as arrays uint21 -- but round-tripping
the high-surrogate code points through a UTF-16 layer would not work.


OK, that seems like a breaking change.


It might mean extra copying, or it might not if the DOM
implementation already uses
UTF-8 internally.

Uh... what does UTF-8 have to do with this?

If you're already storing UTF-8 strings internally, then you are already
doing something expensive (like copying) to get their code units into
and out of JS


Maybe, and maybe not.  We (Mozilla) have had some proposals to actually 
use UTF-8 throughout, including in the JS engine; it's quite possible to 
implement an API that looks like a 16-bit array on top of UTF-8 as long 
as you allow invalid UTF-8 that's needed to represent surrogates and the 
like.



(As a note, Gecko and WebKit both use UTF-16 internally; I would be
_really_ surprised if Trident does not.  No idea about Presto.)

FWIW - last I time I scanned the v8 sources, it appeared to use a
three-representation class, which could store either ASCII, UCS2, or
UTF-8.  Presumably ASCII could also be ISO-Latin-1, as both are exact,
naive, byte-sized UCS2/UTF-16 subsets.


There's a 

Re: Full Unicode strings strawman

2011-05-17 Thread Boris Zbarsky

On 5/17/11 2:24 PM, Allen Wirfs-Brock wrote:

In the substance of having strings in different encodings around at
the same time. If that doesn't force developers to worry about
encodings, what does, exactly?


This already occurs in JS. For example, the encodeURI function produces
a string whose character are the UTF-8 encoding of a UTF-16 string
(including recognition of surrogate pairs).


Last I checked, encodeURI output a pure ASCII string.  Am I just missing 
something?  The ASCII string happens to be the %-escaping of the UTF-8 
representation of the Unicode string you get by assuming that the 
initial JS string is a UTF-16 representation of said Unicode string. 
But at no point here is the author dealing with UTF-8.



OK, but still allows sticking non-Unicode gunk into the strings,
right? So they're still vectors of something. Whatever that
something is.


Conceptually unsigned 32-bit values. The actual internal representation
is likely to be something else.


I don't care about the internal representation; I'm interested in the 
author-observable behavior.



Interpretation of those values is left to the functions (both built-in and 
application) that operate upon them.


OK.  That includes user-written functions, of course, which currently 
only have to deal with UTF-16 (and maybe UCS-2 if you want to be very 
pedantic).



Most built-in string methods do not apply any interpretation and will
happily process strings as vectors of arbitrary uint32 values. Some
built-ins (encodeURI/decodeURI, toUpperCase/toLowerCase) explicitly deal
with Unicode characters or various Unicode encodings and these have to
be explicitly defined to deal with non-Unicode character values or
invalid encodes.


That seems fine.  This is not where problems lie.


These functions already are defined for ES5 in this
manner WRT the representation of strings as vectors of arbitrary uint16
values.


Yes, sure.

-Boris
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


RE: Full Unicode strings strawman

2011-05-17 Thread Phillips, Addison
Note: The W3C Internationalization Core WG published a set of requirements in 
this area for consideration by ES some time ago. It lives here:

   http://www.w3.org/International/wiki/JavaScriptInternationalization 

The section on 'locale related behavior' is being separately addressed.

I think that:

1. Changing references from UCS-2 to UTF-16 makes sense, although the spec, 
IIRC, already *says* UTF-16.
2. Allowing unpaired surrogates is a *requirement*. Yes, such a string is 
ill-formed, but there are too many cases in which one might wish to have such 
broken strings for scripting purposes.
3. We should have escape syntax for supplementary characters (such as 
\U001). Looking up the surrogate pair for a given Unicode character is 
extremely inconvenient and is not self-documenting.

As Shawn notes, basically, there are three ways that one might wish to access 
strings:

- as grapheme clusters (visual units of text)
- as Unicode scalar values (logical units of text, i.e. characters)
- as code units (encoding units of text)

The example I use in the Unicode conference internationalization tutorial is a 
box on a Web site with an ES controlled message underneath it saying You have 
200 characters remaining.

I think it is instructive to look at how Java managed this transition. In some 
cases the 200 represents the number of storage units I have available (as in 
my backing database), in which case String.length is what I probably want. In 
some cases I want to know how many Unicode characters there are (Java solves 
this with the codePointCount(), codePointBefore(), and codePointAt() methods). 
These are relatively rare operations, but they have occasional utility. Or I 
may want grapheme clusters (Java attempts to solve this with BreakIterators and 
I tend to favor doing the same thing in JavaScript---default grapheme clusters 
are better than nothing, but language-specific grapheme clusters are more 
useful).

If we follow the above, providing only minimal additional methods for accessing 
codepoints when necessary, this also limits the impact of adding supplementary 
character support to the language. Regex probably works the way one supposes 
(both \U001 and \ud800\udc00 find the surrogate pair \ud800\udc00 and one 
can still find the low surrogate \udc00 if one wishes too). And existing 
scripts will continue to function without alteration. However, new scripts can 
be written that use supplementary characters. 

Regards,

Addison

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.


 -Original Message-
 From: Shawn Steele [mailto:shawn.ste...@microsoft.com]
 Sent: Tuesday, May 17, 2011 11:09 AM
 To: Brendan Eich; Boris Zbarsky
 Cc: es-discuss
 Subject: RE: Full Unicode strings strawman
 
 I would much prefer changing UCS-2 to UTF-16, thus formalizing that
 surrogate pairs are permitted.  That'd be very difficult to break any existing
 code and would still allow representation of everything reasonable in Unicode.
 
 That would enable Unicode, and allow extending string literals and regular
 expressions for convenience with the U+10 style notation (which would be
 equivalent to the surrogate pair).  The character code manipulation functions
 could be similarly augmented without breaking anything (and maybe not
 needing different names?)
 
 You might want to qualify the UTF-16 as allowing, but strongly discouraging,
 lone surrogates for those people who didn't realize their binary data wasn't a
 string.
 
 The sole disadvantage would be that iterating through a string would require
 consideration of surrogates, same as today.  The same caution is also 
 necessary
 to avoid splitting Ä (U+0041 U+0308) into its component A and   ̈ parts.  I
 wouldn't be opposed to some sort of helper functions or classes that aided in
 walking strings, preferably with options to walk the graphemes (or whatever),
 not just the surrogate pairs.  FWIW: we have such a helper for surrogates
 in .Net and nobody uses them.  The most common feedback is that it's not
 that helpful because it doesn't deal with the graphemes.
 
 - Shawn
 
 shawn.ste...@microsoft.com
 Senior Software Design Engineer
 Microsoft Windows

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


RE: Full Unicode strings strawman

2011-05-17 Thread Shawn Steele
 Right - but they are still legitimate code points, and they fill out the 
 space required to let us treat String as uint16[] when defining the backing 
 store as something that maps to the set of all Unicode code points.

 That said, you can encode these code points with utf-8; for example, 0xdc08 
 becomes 0xed 0xb0 0x88.
No, you're allowing storage of some sort of number arrays that don't represent 
Unicode strings at all.

Codepoints != encoding.  High and Low surrogates are legal code points, but are 
only legitimate code points in UTF-16 if they occur in a pair.  If they aren’t 
in a proper pair, they’re illegal.  They are always illegal in UTF-32  UTF-8.  
There are other code points that shouldn’t be used for interchange in Unicode 
too: U+xx/U+xxFFFE for example.  It’s orthogonal to the other question, but 
the documentation should clearly suggest that users don’t pretend binary data 
is character data when it’s not.  That leads to all sorts of crazy stuff, like 
illegal lone surrogates trying to be illegally encoded in UTF-8.

-Shawn
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Full Unicode strings strawman

2011-05-17 Thread Allen Wirfs-Brock

On May 17, 2011, at 12:00 PM, Phillips, Addison wrote:

 Note: The W3C Internationalization Core WG published a set of requirements 
 in this area for consideration by ES some time ago. It lives here:
 
   http://www.w3.org/International/wiki/JavaScriptInternationalization 

You might want to formally convey these requests to TC39 via the W3C/Ecma 
liaison process. That would carry much more weight and visibility.  I don't 
believe this document has shown up on any TC39 agenda or has been incorporated 
into any of our planning.

Allen
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: arrow syntax unnecessary and the idea that function is too long

2011-05-17 Thread Peter Michaux
On Tue, May 17, 2011 at 10:50 AM, Axel Rauschmayer a...@rauschma.de wrote:
 On May 17, 2011, at 4:57, Peter Michaux petermich...@gmail.com wrote:

 The goal of pleasing compiler writers should be
 to make it possible to compile existing languages like Perl, Ruby,
 Python and Scheme to JavaScript. These languages are the ones that
 people know and really want to use and target their compilers to
 JavaScript.

 You sound like you really hate JavaScript and can’t imagine working with it
 unless some other language is compiled to it.

Actually the opposite is true. I write in JavaScript all day and like
it a lot. I wouldn't want to compile to JavaScript with today's
possibility.

What I was trying to express is that I believe dream of people who
want to compile to JavaScript is to write in their server-side
language of choice (e.g. Perl, Python, Ruby, Scheme, Java, etc) and
compile that to JavaScript.

Peter
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


RE: Full Unicode strings strawman

2011-05-17 Thread Phillips, Addison
We did. 

Cf. http://lists.w3.org/Archives/Public/public-i18n-core/2009OctDec/0102.html 

Addison

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.




 -Original Message-
 From: Allen Wirfs-Brock [mailto:al...@wirfs-brock.com]
 Sent: Tuesday, May 17, 2011 12:16 PM
 To: Phillips, Addison
 Cc: Shawn Steele; Brendan Eich; Boris Zbarsky; es-discuss
 Subject: Re: Full Unicode strings strawman
 
 
 On May 17, 2011, at 12:00 PM, Phillips, Addison wrote:
 
  Note: The W3C Internationalization Core WG published a set of
 requirements in this area for consideration by ES some time ago. It lives 
 here:
 
http://www.w3.org/International/wiki/JavaScriptInternationalization
 
 You might want to formally convey these requests to TC39 via the W3C/Ecma
 liaison process. That would carry much more weight and visibility.  I don't
 believe this document has shown up on any TC39 agenda or has been
 incorporated into any of our planning.
 
 Allen
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Full Unicode strings strawman

2011-05-17 Thread Wes Garland
On 17 May 2011 14:39, Boris Zbarsky bzbar...@mit.edu wrote:

 On 5/17/11 2:12 PM, Wes Garland wrote:

 That said, you can encode these code points with utf-8; for example,
 0xdc08 becomes 0xed 0xb0 0x88.


 By the same argument, you can encode them in UTF-16.  The byte sequence
 above is not valid UTF-8.  See How do I convert an unpaired UTF-16
 surrogate to UTF-8? at http://unicode.org/faq/utf_bom.html which says:


You are comparing apples and oranges. Which happen to look a lot alike. So
maybe apples and nectarines.

But the point remains, the FAQ entry you quote talks about encoding a lone
surrogate, i.e. a code unit, which is not a complete code point. You can
only convert complete code points from one encoding to another. Just like
you can't represent part of a UTF-8 code sub-sequence in any other encoding.
The fact that code point X is not representable in UTF-16 has no bearing on
its status as a code point, nor its convertability to UTF-8.  The problem is
that UTF-16 cannot represent all possible code points.


 See above.  You're allowing number arrays that may or may not be
 interpretable as Unicode strings, period.


No, I'm not.  Any sequence of Unicode code points is a valid Unicode string.
It does not matter whether any of those code points are reserved, nor does
it matter if it can be represented in all encodings.

From page 90 of the Unicode 6.0 specification, in the Conformance chapter:

 *D80 Unicode string:* A code unit sequence containing code units of a
 particular Unicode
 encoding form.
 • In the rawest form, Unicode strings may be implemented simply as arrays
 of
 the appropriate integral data type, consisting of a sequence of code units
 lined
 up one immediately after the other.
 • A single Unicode string must contain only code units from a single
 Unicode
 encoding form. It is not permissible to mix forms within a string.



Not sure what (D80) is supposed to mean.


Sorry, (D80) means per definition D80 of The Unicode Standard, Version
6.0



 This hypothesis is worth testing before being blindly inflicted on the web.


I don't think anybody in this discussion is talking about blindly inflicting
anything on the web.  I *do* think this proposal is a good one, and
certainly a better way forward than insisting that every JS developer,
everywhere, understand and implement (over and over again) the details of
encoding Unicode as UTF-16. Allen's point about URI escaping being right on
target here.



  If we redefine JS Strings to be arrays of Unicode code points, then the
 JS application can use JS Strings as arrays uint21 -- but round-tripping
 the high-surrogate code points through a UTF-16 layer would not work.


 OK, that seems like a breaking change.


Yes, I believe it would be, certainly if done naively, but I am hopeful
somebody can figure out how to overcome this.  Hopeful because I think that
fixing the JS Unicode problem is a really big deal. What happens if the guy
types a non-BMP character? is a question which should not have to be
answered over and over again in every code review.  And I still maintain
that 99.99% of JS developers never give it first, let alone second, thought.

Maybe, and maybe not.  We (Mozilla) have had some proposals to actually use
 UTF-8 throughout, including in the JS engine; it's quite possible to
 implement an API that looks like a 16-bit array on top of UTF-8 as long as
 you allow invalid UTF-8 that's needed to represent surrogates and the like.


I understand by this that in the Moz proposals, you mean that the invalid
UTF-8 sequences are actually valid UTF-8 Strings which encode code points in
the range 0xd800-0xdfff, and that these code points were translated directly
(and purposefully incorrectly) as UTF-16 code units when viewed as 16-bit
arrays.

If JS Strings were arrays of Unicode code points, this conversion would be a
non-issue; UTF-8 sequence 0xed 0xb0 0x88 becomes Unicode code point 0xdc08,
with no incorrect conversion taking place.  The only problem is if there is
an intermediate component somewhere that insists on using UTF-16..at that
point we just can't represent code point 0xdc08 at all.  But that code point
will never appear in text; it will only appear for users using the String to
store arbitrary data, and their need has already been met..

Wes

-- 
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Full Unicode strings strawman

2011-05-17 Thread Wes Garland
On 17 May 2011 15:00, Phillips, Addison addi...@lab126.com wrote:

 2. Allowing unpaired surrogates is a *requirement*. Yes, such a string is
 ill-formed, but there are too many cases in which one might wish to have
 such broken strings for scripting purposes.
 3. We should have escape syntax for supplementary characters (such as
 \U001). Looking up the surrogate pair for a given Unicode character is
 extremely inconvenient and is not self-documenting.

...

 As Shawn notes, basically, there are three ways that one might wish to
 access strings:

...
- as code units (encoding units of text)

I don't understand why (except that it is there by an accident of history)
that it is desirable to expose a particular low-level detail about one
possible encoding for Unicode characters to end-user programmers.

Your point about database storage only holds if the database happens to
store Unicode strings encoded in UTF-16. It could just as easily use UTF-8,
UTF-7, or UTF-32. For that matter, the database input routine could filter
all characters not in ISO-Latin-1 and store only the lower half of
non-surrogate-pair UTF-16 code units.

Wes

-- 
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Full Unicode strings strawman

2011-05-17 Thread Boris Zbarsky

On 5/17/11 3:29 PM, Wes Garland wrote:

But the point remains, the FAQ entry you quote talks about encoding a
lone surrogate, i.e. a code unit, which is not a complete code point.
You can only convert complete code points from one encoding to another.
Just like you can't represent part of a UTF-8 code sub-sequence in any
other encoding. The fact that code point X is not representable in
UTF-16 has no bearing on its status as a code point, nor its
convertability to UTF-8.  The problem is that UTF-16 cannot represent
all possible code points.


My point is that neither can UTF-8.  Can you name an encoding that _can_ 
represent the surrogate-range codepoints?



 From page 90 of the Unicode 6.0 specification, in the Conformance chapter:

/D80 Unicode string:/ A code unit sequence containing code units of
a particular Unicode
encoding form.
• In the rawest form, Unicode strings may be implemented simply as
arrays of
the appropriate integral data type, consisting of a sequence of code
units lined
up one immediately after the other.
• A single Unicode string must contain only code units from a single
Unicode
encoding form. It is not permissible to mix forms within a string.



Not sure what (D80) is supposed to mean.


Sorry, (D80) means per definition D80 of The Unicode Standard,
Version 6.0


Ah, ok.  So the problem there is that this is definition only makes 
sense when a particular Unicode encoding form has been chosen.  Which 
Unicode encoding form have we chosen here?


But note also that D76 in that same document says:

  Unicode scalar value: Any Unicode code point except high-surrogate
and low-surrogate code points.

and D79 says:

  A Unicode encoding form assigns each Unicode scalar value to a unique
  code unit sequence.

and

  To ensure that the mapping for a Unicode encoding form is
  one-to-one, all Unicode scalar values, including those
  corresponding to noncharacter code points and unassigned code
  points, must be mapped to unique code unit sequences. Note that
  this requirement does not extend to high-surrogate and
  low-surrogate code points, which are excluded by definition from
  the set of Unicode scalar values.

In particular, this makes it clear (to me, at least) that whatever 
Unicode encoding form you choose, a Unicode string can only consist of 
code units encoding Unicode scalar values, which does NOT include high 
and low surrogates.


Therefore I stand by my statement: if you allow what to me looks like 
arrays UTF-32 code units and also values that fall into the surrogate 
ranges then you don't get Unicode strings.  You get a set of arrays 
that contains Unicode strings as a proper subset.



OK, that seems like a breaking change.

Yes, I believe it would be, certainly if done naively, but I am hopeful
somebody can figure out how to overcome this.


As long as we worry about that _before_ enshrining the result in a spec, 
I'm all of being hopeful.



Maybe, and maybe not.  We (Mozilla) have had some proposals to
actually use UTF-8 throughout, including in the JS engine; it's
quite possible to implement an API that looks like a 16-bit array on
top of UTF-8 as long as you allow invalid UTF-8 that's needed to
represent surrogates and the like.


I understand by this that in the Moz proposals, you mean that the
invalid UTF-8 sequences are actually valid UTF-8 Strings which encode
code points in the range 0xd800-0xdfff


There are no such valid UTF-8 strings; see spec quotes above.  The 
proposal would have involved having invalid pseudo-UTF-ish strings.



and that these code points were
translated directly (and purposefully incorrectly) as UTF-16 code units
when viewed as 16-bit arrays.


Yep.


If JS Strings were arrays of Unicode code points, this conversion would
be a non-issue; UTF-8 sequence 0xed 0xb0 0x88 becomes Unicode code point
0xdc08, with no incorrect conversion taking place.


Sorry, no.  See above.


The only problem is
if there is an intermediate component somewhere that insists on using
UTF-16..at that point we just can't represent code point 0xdc08 at all.


I just don't get it.  You can stick the invalid 16-bit value 0xdc08 into 
a UTf-16 string just as easily as you can stick the invalid 24-bit 
sequence 0xed 0xb0 0x88 into a UTF-8 string.  Can you please, please 
tell me what made you decide there's _any_ difference between the two 
cases?  They're equally invalid in _exactly_ the same way.


-Boris
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Full Unicode strings strawman

2011-05-17 Thread Mark Davis ☕
The wrong conclusion is being drawn. I can say definitively that for the
string a\uD800b.

   - It is a valid Unicode string, according to the Unicode Standard.
   - It cannot be encoded as well-formed in any UTF-x (it is not
   'well-formed' in any UTF).
   - When it comes to conversion, the bad code unit \uD800 needs to be
   handled (eg converted to FFFD, escaped, etc.)

Any programming language using Unicode has the choice of either

   1. allowing strings to be general Unicode strings, or
   2. guaranteeing that they are always well-formed.

There are trade-offs either way, but both are feasible.

Mark

*— Il meglio è l’inimico del bene —*


On Tue, May 17, 2011 at 13:03, Boris Zbarsky bzbar...@mit.edu wrote:

 On 5/17/11 3:29 PM, Wes Garland wrote:

 But the point remains, the FAQ entry you quote talks about encoding a
 lone surrogate, i.e. a code unit, which is not a complete code point.
 You can only convert complete code points from one encoding to another.
 Just like you can't represent part of a UTF-8 code sub-sequence in any
 other encoding. The fact that code point X is not representable in
 UTF-16 has no bearing on its status as a code point, nor its
 convertability to UTF-8.  The problem is that UTF-16 cannot represent
 all possible code points.


 My point is that neither can UTF-8.  Can you name an encoding that _can_
 represent the surrogate-range codepoints?


   From page 90 of the Unicode 6.0 specification, in the Conformance
 chapter:

/D80 Unicode string:/ A code unit sequence containing code units of
a particular Unicode
encoding form.
• In the rawest form, Unicode strings may be implemented simply as
arrays of
the appropriate integral data type, consisting of a sequence of code
units lined
up one immediately after the other.
• A single Unicode string must contain only code units from a single
Unicode
encoding form. It is not permissible to mix forms within a string.



Not sure what (D80) is supposed to mean.


 Sorry, (D80) means per definition D80 of The Unicode Standard,
 Version 6.0


 Ah, ok.  So the problem there is that this is definition only makes sense
 when a particular Unicode encoding form has been chosen.  Which Unicode
 encoding form have we chosen here?

 But note also that D76 in that same document says:

  Unicode scalar value: Any Unicode code point except high-surrogate
and low-surrogate code points.

 and D79 says:

  A Unicode encoding form assigns each Unicode scalar value to a unique
  code unit sequence.

 and

  To ensure that the mapping for a Unicode encoding form is
  one-to-one, all Unicode scalar values, including those
  corresponding to noncharacter code points and unassigned code
  points, must be mapped to unique code unit sequences. Note that
  this requirement does not extend to high-surrogate and
  low-surrogate code points, which are excluded by definition from
  the set of Unicode scalar values.

 In particular, this makes it clear (to me, at least) that whatever Unicode
 encoding form you choose, a Unicode string can only consist of code units
 encoding Unicode scalar values, which does NOT include high and low
 surrogates.

 Therefore I stand by my statement: if you allow what to me looks like
 arrays UTF-32 code units and also values that fall into the surrogate
 ranges then you don't get Unicode strings.  You get a set of arrays that
 contains Unicode strings as a proper subset.


 OK, that seems like a breaking change.

 Yes, I believe it would be, certainly if done naively, but I am hopeful
 somebody can figure out how to overcome this.


 As long as we worry about that _before_ enshrining the result in a spec,
 I'm all of being hopeful.


 Maybe, and maybe not.  We (Mozilla) have had some proposals to
actually use UTF-8 throughout, including in the JS engine; it's
quite possible to implement an API that looks like a 16-bit array on
top of UTF-8 as long as you allow invalid UTF-8 that's needed to
represent surrogates and the like.


 I understand by this that in the Moz proposals, you mean that the
 invalid UTF-8 sequences are actually valid UTF-8 Strings which encode
 code points in the range 0xd800-0xdfff


 There are no such valid UTF-8 strings; see spec quotes above.  The proposal
 would have involved having invalid pseudo-UTF-ish strings.


  and that these code points were
 translated directly (and purposefully incorrectly) as UTF-16 code units
 when viewed as 16-bit arrays.


 Yep.


  If JS Strings were arrays of Unicode code points, this conversion would
 be a non-issue; UTF-8 sequence 0xed 0xb0 0x88 becomes Unicode code point
 0xdc08, with no incorrect conversion taking place.


 Sorry, no.  See above.


  The only problem is
 if there is an intermediate component somewhere that insists on using
 UTF-16..at that point we just can't represent code point 0xdc08 at all.


 I just don't get it.  You can stick the invalid 16-bit value 

Re: Full Unicode strings strawman

2011-05-17 Thread Wes Garland
On 17 May 2011 16:03, Boris Zbarsky bzbar...@mit.edu wrote:

 On 5/17/11 3:29 PM, Wes Garland wrote:

 The problem is that UTF-16 cannot represent
 all possible code points.


 My point is that neither can UTF-8.  Can you name an encoding that _can_
 represent the surrogate-range codepoints?


UTF-8 and UTF-32.  I think UTF-7 can, too, but it is not a standard so it's
not really worth discussing.  UTF-16 is the odd one out.

Therefore I stand by my statement: if you allow what to me looks like arrays
 UTF-32 code units and also values that fall into the surrogate ranges then
 you don't get Unicode strings.  You get a set of arrays that contains
 Unicode strings as a proper subset.


Okay, I think we have to agree to disagree here. I believe my reading of the
spec is correct.


 There are no such valid UTF-8 strings; see spec quotes above.  The proposal
 would have involved having invalid pseudo-UTF-ish strings.


Yes, you can encode code points d800 - dfff in UTF-8 Strings.  These are not
*well-formed* strings, but they are Unicode 8-bit Strings (D81) nonetheless.
What you can't do is encode 16-bit code units in UTF-8 Strings. This is
because you can only convert from one encoding to another via code points.
Code units have no cross-encoding meaning.

Further, you can't encode code points d800 - dfff in UTF-16 Strings, leaving
you at a loss when you want to store those values in JS Strings (i.e. when
using them as uint16[]) except to generate ill-formed UTF-16. I believe it
would be far better to treat those values as Unicode code points, not 16-bit
code units, and to allow JS String elements to be able to express the whole
21-bit code point range afforded by Unicode.

In other words, current mis-use of JS Strings which can store characters
0- in ill-formed UTF-16 strings would become use of JS Strings to store
code points 0-1F which may use reserved code points d800-dfff, the high
surrogates, which cannot be represented in UTF-16. But CAN be represented,
without loss, in UTF-8, UTF-32, and proposed-new-JS-Strings.


  If JS Strings were arrays of Unicode code points, this conversion would
 be a non-issue; UTF-8 sequence 0xed 0xb0 0x88 becomes Unicode code point
 0xdc08, with no incorrect conversion taking place.


 Sorry, no.  See above.


# printf '\xed\xb0\x88' | iconv -f UTF-8 -t UCS-4BE | od -x
000  dc08
004
# printf '\000\000\xdc\x08' | iconv -f UCS-4BE -t UTF-8 | od -x
000 edb0 8800
003

I just don't get it.  You can stick the invalid 16-bit value 0xdc08 into a
 UTf-16 string just as easily as you can stick the invalid 24-bit sequence
 0xed 0xb0 0x88 into a UTF-8 string.  Can you please, please tell me what
 made you decide there's _any_ difference between the two cases?  They're
 equally invalid in _exactly_ the same way.


The difference is that in UTF-8, 0xed 0xb0 0x88 means The Unicode code
point 0xdc08, and in UTF-16 0xdc08 means Part of some non-BMP code point.

Said another way, 0xed in UTF-8 has nearly the same meaning as 0xdc08 in
UTF-16.  Both are ill-formed code unit subsequences which do not represent a
code unit (D84a).

Wes

-- 
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Full Unicode strings strawman

2011-05-17 Thread Boris Zbarsky

On 5/17/11 5:24 PM, Wes Garland wrote:

UTF-8 and UTF-32.  I think UTF-7 can, too, but it is not a standard so
it's not really worth discussing.  UTF-16 is the odd one out.


That's not what the spec says.


Okay, I think we have to agree to disagree here. I believe my reading of
the spec is correct.


Sorry, but no...  how much more clear can the spec get?


There are no such valid UTF-8 strings; see spec quotes above.  The
proposal would have involved having invalid pseudo-UTF-ish strings.


Yes, you can encode code points d800 - dfff in UTF-8 Strings.  These are
not /well-formed/ strings, but they are Unicode 8-bit Strings (D81)
nonetheless.


The spec seems to pretty clearly define UTF-8 strings as things that do 
NOT contain the encoding of those code points.  If you think otherwise, 
cite please.



Further, you can't encode code points d800 - dfff in UTF-16 Strings,


Where does the spec say this?  And why does that part of the spec not 
apply to UTF-8?



# printf '\xed\xb0\x88' | iconv -f UTF-8 -t UCS-4BE | od -x
000  dc08
004
# printf '\000\000\xdc\x08' | iconv -f UCS-4BE -t UTF-8 | od -x
000 edb0 8800
003


As far as I can tell, that second conversion is just an implementation 
bug per the spec.  See the part I quoted which explicitly says that an 
encoder in that situation must stop and return an error.



The difference is that in UTF-8, 0xed 0xb0 0x88 means The Unicode code
point 0xdc08


According to the spec you were citing, that code unit sequence means a 
UTF-8 decoder should error, no?


-Boris
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: arrow syntax unnecessary and the idea that function is too long

2011-05-17 Thread Brendan Eich
On May 17, 2011, at 5:04 PM, Kyle Simpson wrote:

 Regarding the - and = syntax, I just want to throw out one other concern 
 that I hope is taken into account, not only now, but for the future: I really 
 hope that we don't get to the point where we start adding functionality to 
 that style of function that is not available to explicit functions (we're 
 almost, but not, there with having = do the magical `this` binding).

You have to distinguish syntax from semantics.

There's nothing proposed for arrow functions that is more than shorter syntax 
-- including |this| binding.


 I know Brendan and others have declared it's shorthand only, but it can be 
 a slippery slope, and to rely on the if you don't like - don't use it 
 argument, we have to make sure that it really stays only a shorthand and 
 nothing more, otherwise it's tail-wagging-the-dog.

Agreed, which is why I'm still going to write up a fairy radical Ruby-block 
proposal that competes in this sense: it too gives better function syntax, 
plus semantics not available to arrow functions as constrained to be just 
syntax.

I hope we'll be able to decide between these two approaches quickly, since I do 
not want to do both. That is, either arrow functions win and shorter syntax 
is enough; or we have blocks for control abstraction (which means other syntax 
changes, details soon) and no arrow functions.


 In other words, I hope that those who favor - aren't also hoping that 
 eventually - replaces `function` entirely. As stated many times thus far in 
 this thread, there are still those of us who favor (and maybe always will) 
 the explicitness of `function(){}` or `#(){}`.

There's no way to remove 'function' long syntax from JS. Just no way.

Your point about just syntax is well-taken, since translation tools will be 
important in aiding Harmony migration -- not just for targeting downrev 
browsers but for added static checking -- and these should be as simple (local 
rewriting, e.g. transpilers not compilers) as possible.

/be

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Full Unicode strings strawman

2011-05-17 Thread Wes Garland
On 17 May 2011 20:09, Boris Zbarsky bzbar...@mit.edu wrote:

 On 5/17/11 5:24 PM, Wes Garland wrote:

 Okay, I think we have to agree to disagree here. I believe my reading of
 the spec is correct.


 Sorry, but no...  how much more clear can the spec get?


In the past, I have read it thus, pseudo BNF:

UnicodeString = CodeUnitSequence // D80
CodeUnitSequence = CodeUnit | CodeUnitSequence CodeUnit // D78
CodeUnit = anything in the current encoding form // D77

Upon careful re-reading of this part of the specification, I see that D79 is
also important.  It says that A Unicode encoding form assigns each Unicode
scalar value to a unique code unit sequence., and further clarifies that
The mapping of the set of Unicode scalar values to the set of code unit
sequences for a Unicode encoding form is one-to-one.

This means that your original assertion -- that Unicode strings cannot
contain the high surrogate code points, regardless of meaning -- is in fact
correct.

Which is unfortunate, as it means that we either

   1. Allow non-Unicode strings in JS -- i.e. Strings composed of all values
   in the set [0x0, 0x1F]
   2. Keep making programmers pay the raw-UTF-16 representation tax
   3. Break the String-as-uint16 pattern

I still believe that #1 is the way forward, and that problem of
round-tripping these values through the DOM is solvable.

Wes

-- 
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Full Unicode strings strawman

2011-05-17 Thread Mark Davis ☕
That is incorrect. See below.

Mark

*— Il meglio è l’inimico del bene —*


On Tue, May 17, 2011 at 18:33, Wes Garland w...@page.ca wrote:

 On 17 May 2011 20:09, Boris Zbarsky bzbar...@mit.edu wrote:

 On 5/17/11 5:24 PM, Wes Garland wrote:

 Okay, I think we have to agree to disagree here. I believe my reading of
 the spec is correct.


 Sorry, but no...  how much more clear can the spec get?


 In the past, I have read it thus, pseudo BNF:

 UnicodeString = CodeUnitSequence // D80
 CodeUnitSequence = CodeUnit | CodeUnitSequence CodeUnit // D78
 CodeUnit = anything in the current encoding form // D77


So far, so good. In particular, d800 is a code unit for UTF-16, since it is
a code unit that can occur in some code unit sequence in UTF-16.



 Upon careful re-reading of this part of the specification, I see that D79
 is also important.  It says that A Unicode encoding form assigns each
 Unicode scalar value to a unique code unit sequence.,


True.


 and further clarifies that The mapping of the set of Unicode scalar values
 to the set of code unit sequences for a Unicode encoding form is
 one-to-one.


True.

This is all consistent with saying that UTF-16 can't contain an isolated
d800.

*However, that only shows that a Unicode 16-bit string (D82) is not the same
as a UTF-16 String (D89), which has been pointed out previously.*
*
*

Repeating the note under D89:


A Unicode string consisting of a well-formed UTF-16 code unit sequence is
said
to be *in UTF-16*. Such a Unicode string is referred to as a *valid UTF-16
string*,
or a *UTF-16 string* for short.

*
*
That is, every UTF-16 string is a Unicode 16-bit string, but *not* vice
versa.

Examples:

   - \u0061\ud800\udc00 is both a Unicode 16-bit string and a UTF-16
   string.
   - \u0061\ud800\udc00 is a Unicode 16-bit string, but not a UTF-16
   string.



 This means that your original assertion -- that Unicode strings cannot
 contain the high surrogate code points, regardless of meaning -- is in fact
 correct.


That is incorrect.



 Which is unfortunate, as it means that we either

1. Allow non-Unicode strings in JS -- i.e. Strings composed of all
values in the set [0x0, 0x1F]
2. Keep making programmers pay the raw-UTF-16 representation tax
3. Break the String-as-uint16 pattern

 I still believe that #1 is the way forward, and that problem of
 round-tripping these values through the DOM is solvable.

 Wes

 --
 Wesley W. Garland
 Director, Product Development
 PageMail, Inc.
 +1 613 542 2787 x 102

 ___
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/listinfo/es-discuss


___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Private Names in 'text/javascript'

2011-05-17 Thread Luke Hoban
The Private Names strawman currently combines a new runtime capability (using 
both strings and private names as keys in objects) with several new syntactic 
constructs (private binding declarations, #.id).  At the March meeting, I 
recall there was some support for the idea of separating these two aspects, and 
exposing the runtime capability also as a library that could be used in 
'text/javascript'.

I added a comment to the Private Names strawman page to suggest how this could 
be done.  The runtime behavior of the proposal is the same, but in addition, a  
library function Object.createPrivteName(name) is added which provides direct 
access to the internal CreatePrivateName abstract operation.  This allows the 
use of private names in a more verbose form, but without needing new syntax - 
similar in spirit to the ES5 Object.* operations.
Borrowing an example from the current proposal to illustrate:

Using 'text/harmony' syntax:
  function Point(x,y) {
 private x, y;
 this.x = x;
 this.y = y;
 //... methods that use private x and y properties
  }
  var pt = new Point(1,2);

Using 'text/javascript' syntax:
  function Point(x,y) {
 var _x = Object.createPrivateName(x);
 var _y = Object.createPrivateName(y);
 this[_x] = x;
 this[_y] = y;
 //... methods that use private _x and _y properties
  }
  var pt = new Point(1,2);

There seem to be several benefits to this:
(1) The private name capability can be made available to 'text/javascript'
(2) The feature is easily feature-detectable, with a fallback of using 
'_'-prefixed or similar pseudo-private conventions
(3) The core functionality can potentially be agreed upon and implemented in 
engines earlier than full new syntax

Luke

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Private Names in 'text/javascript'

2011-05-17 Thread David Herman
Yes, I agree that separating them out is a good idea. Allen and I have been 
working on this lately, and I've signed up to present private names at the 
upcoming face-to-face. Our thinking has been along similar lines to what you 
describe here.

Dave

On May 17, 2011, at 6:55 PM, Luke Hoban wrote:

 The Private Names strawman currently combines a new runtime capability (using 
 both strings and private names as keys in objects) with several new syntactic 
 constructs (private binding declarations, #.id).  At the March meeting, I 
 recall there was some support for the idea of separating these two aspects, 
 and exposing the runtime capability also as a library that could be used in 
 ‘text/javascript’. 
  
 I added a comment to the Private Names strawman page to suggest how this 
 could be done.  The runtime behavior of the proposal is the same, but in 
 addition, a  library function “Object.createPrivteName(name)” is added which 
 provides direct access to the internal CreatePrivateName abstract operation.  
 This allows the use of private names in a more verbose form, but without 
 needing new syntax – similar in spirit to the ES5 Object.* operations.
 Borrowing an example from the current proposal to illustrate:
  
 Using ‘text/harmony’ syntax:
 function Point(x,y) {
private x, y;
this.x = x;
this.y = y;
//... methods that use private x and y properties
 }
 var pt = new Point(1,2);
  
 Using ‘text/javascript’ syntax:
 function Point(x,y) {
var _x = Object.createPrivateName(x);
var _y = Object.createPrivateName(y);
this[_x] = x;
this[_y] = y;
//... methods that use private _x and _y properties
 }
 var pt = new Point(1,2);
  
 There seem to be several benefits to this:
 (1) The private name capability can be made available to ‘text/javascript’
 (2) The feature is easily feature-detectable, with a fallback of using 
 ‘_’-prefixed or similar pseudo-private conventions
 (3) The core functionality can potentially be agreed upon and implemented in 
 engines earlier than full new syntax
  
 Luke
  
 ___
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/listinfo/es-discuss

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


RE: Private Names in 'text/javascript'

2011-05-17 Thread Luke Hoban

  Yes, I agree that separating them out is a good idea. Allen and I have 
been working on this lately, and I've signed up to present private names at the 
upcoming face-to-face. Our thinking has been along similar lines to what you 
describe here.

  Dave

Great - I see the new unique_string_values strawman now.  That looks like it 
does address the same goal.  Happy to see there is already progress on this.  
Was there a particular reason for the shift to treating these names as a new 
kind of string value instead of as a separate object kind which could be used 
as a key in objects?

Luke

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Private Names in 'text/javascript'

2011-05-17 Thread Allen Wirfs-Brock
Yes (from my perspective) but it is something we are still hashing out so don't 
assume that will be the final proposal.

Allen
On May 17, 2011, at 7:33 PM, Luke Hoban wrote:

  
 Yes, I agree that separating them out is a good idea. Allen and I have been 
 working on this lately, and I've signed up to present private names at the 
 upcoming face-to-face. Our thinking has been along similar lines to what you 
 describe here.
  
 Dave
  
 Great – I see the new unique_string_values strawman now.  That looks like it 
 does address the same goal.  Happy to see there is already progress on this.  
 Was there a particular reason for the shift to treating these names as a new 
 kind of string value instead of as a separate object kind which could be used 
 as a key in objects?
  
 Luke
  
 ___
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/listinfo/es-discuss

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


prototype for operator proposal for review

2011-05-17 Thread Allen Wirfs-Brock
We had so much fun with feedback on my Unicode proposal I just have open 
another one up for list feed back:

An updated version of the prototype for (formerly proto) operator proposal is 
at http://wiki.ecmascript.org/doku.php?id=strawman:proto_operator 

Allen___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Full Unicode strings strawman

2011-05-17 Thread Wes Garland
Mark;

Are you Dr. *Mark E. Davis* (born September 13, 1952 (age 58)), co-founder
of the Unicode http://en.wikipedia.org/wiki/Unicode project and the
president of the Unicode
Consortiumhttp://en.wikipedia.org/wiki/Unicode_Consortiumsince its
incorporation in 1991?

(If so, uh, thanks for giving me alternatives to Shift-JIS, GB-2312, Big-5,
et al..those gave me lots of hair loss in the late 90s)

On 17 May 2011 21:55, Mark Davis ☕ m...@macchiato.com wrote:In the past, I
have read it thus, pseudo BNF:


 UnicodeString = CodeUnitSequence // D80
 CodeUnitSequence = CodeUnit | CodeUnitSequence CodeUnit // D78
 CodeUnit = anything in the current encoding form // D77


 So far, so good. In particular, d800 is a code unit for UTF-16, since it is
 a code unit that can occur in some code unit sequence in UTF-16.


*head smack* - code unit, not code point.




 This means that your original assertion -- that Unicode strings cannot
 contain the high surrogate code points, regardless of meaning -- is in fact
 correct.


 That is incorrect.


Aie, Karumba!

If we have

   - a sequence of code points
   - taking on values between 0 and 0x1F
   - including high surrogates and other reserved values
   - independent of encoding

..what exactly are we talking about?  Can it be represented in UTF-16
without round-trip loss when normalization is not performed, for the code
points 0 through 0x?

Incidentally, I think this discussion underscores nicely why I think we
should work hard to figure out a way to hide UTF-16 encoding details from
user-end programmers.

Wes

-- 
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: I noted some open issues on Classes with Trait Composition

2011-05-17 Thread Mark S. Miller
On Sun, May 15, 2011 at 10:01 PM, Brendan Eich bren...@mozilla.com wrote:


 http://wiki.ecmascript.org/doku.php?id=strawman:classes_with_trait_composition#open_issues


That wiki page has no had extensive revisions in light of recent discussions
with Brendan, Allen, Dave Herman, and Bob Nystrom. It derives from previous
discussions with Allen, Bob, and Peter Hallam. I have tried to capture here
as best as I could the consensus that has emerged from these discussions.

All this derives from earlier discussions that also included Waldemar, Alex
Russell, Arv, and Tom Van Cutsem. And the experience of the Traceur project
made a significant contribution. This has all had a long history so if I've
left out some key contributors, please let me know, thanks.



 This looks pretty good at a glance, but it's a *lot*, and it's new.


It's much less now! The main effect of all the recent feedback was to find
opportunities to remove things. What remains is mostly just a way to express
the familiar pattern by which JavaScript programmers manually express
class-like semantics using prototypes. The result interoperates in both
directions with such old code: a class can inherit from a traditional
constructor function and vice versa.



 I have to say this reminds me of ES4 classes. That's neither bad nor good,
 but it's not just superficial, as far as I can tell (and I was reading specs
 then and now).


It definitely had an influence. There were many things I liked about ES4
classes.



 On the other hand, I'm in no rush to standardize something this complex and
 yet newly strawman-spec'ed and yet unimplemented. So we may as well take our
 time, learn from history, and go around the karmic wheel again for another
 few years...

 I'm not against classes as a near-term objective, but in order to 
 *be*near-term and not to unwind in committee, I believe they have to be dead
 simple and prototypal, with very few knobs, bells and whistles.


I am indeed proposing this as a near term objective. The usual caveats
apply: we are asking the committee to approve the general shape presented by
this strawman, with syntactic and semantic refinements expected to continue,
for this as for all other proposals, after May.

Brendan, with all the simplifications since you posted this email, in your
opinion, have we achieved the level of simplicity needed?



 Factoring out privacy


Done.


 and leaving constructor in charge of per-instance property setting, as it
 is in ES5,


Done.


 would IMHO help.


Hope so ;).


I do understand that this page may be hard to appreciate without motivation
and examples. I'm hoping these are coming soon.

-- 
Cheers,
--MarkM
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: I noted some open issues on Classes with Trait Composition

2011-05-17 Thread Mark S. Miller
On Sun, May 15, 2011 at 11:49 PM, Brendan Eich bren...@mozilla.com wrote:

 On May 15, 2011, at 10:01 PM, Brendan Eich wrote:


 http://wiki.ecmascript.org/doku.php?id=strawman:classes_with_trait_composition#open_issues

 This looks pretty good at a glance, but it's a *lot*, and it's new.


 Looking closer, I have to say something non-nit-picky that looks bad and
 smells like committee:


 http://wiki.ecmascript.org/doku.php?id=strawman:classes_with_trait_composition#inheritance

 Two kinds of inheritance, depending on the dynamic type of the result of
 evaluating the //MemberExpression// on the right of ''extends''? That will
 be confusing.


This smell is actually just my fault; it did not derive from ideas arrived
at in meetings. In any case, it is gone. super(x, y); is now always simply
equivalent to Superclass.call(this, x, y);, but as if using the original
rather than the current binding of Function.prototype.call.




 Is the traits-composition way really needed in this proposal? If so, then
 please consider not abuse ''extends'' to mean ''compose'' depending on
 dynamic type of result of expression to its right.


All dependencies on traits have been separated into a separate strawman,
extending this one, but not to be proposed until after ES-next. The only
inheritance in this one is traditional JS prototypal inheritance.

-- 
Cheers,
--MarkM
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: I noted some open issues on Classes with Trait Composition

2011-05-17 Thread Mark S. Miller
On Mon, May 16, 2011 at 4:54 AM, Dmitry A. Soshnikov 
dmitry.soshni...@gmail.com wrote:
[...]

 Some simple examples of all use-cases would are needed I think.


Absolutely agree. I hope they are coming soon. Watch this space ;).




 Regarding `new` keyword for the constructor (aka initializer), after all,
 it als may be OK. E.g. Ruby uses `new` as exactly the method of a class --
 Array.new, Object.new, etc. Though,  `constructor` is also good yeah.


The history here is interesting. An earlier unreleased version of the
Traceur compiler used constructor. When we saw Allen's use of new in one
of the object-literal-based class proposals, it seemed like a good idea so
we switched to that. In light of Brendan's criticism, we realized we should
return to constructor -- it's an elegant pun.



 Regarding two inheritance types, I think better to make nevertheless one
 inheritance type -- linear (by prototype chain).


Done.



 And to make additionally small reusable code units -- mixins or traits --
 no matter. Thus, of course if they will also be delegation-based and not
 just copy-own-properties, then we automatically get a sort of multiple
 inheritance.


Gone. Or rather, postponed into a strawman that will not be proposed till
after ES-next.

[...]

-- 
Cheers,
--MarkM
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


RE: prototype for operator proposal for review

2011-05-17 Thread Luke Hoban
If there were a more usable library variant of Object.create instead, it seems 
the new syntax here would not be as necessary.

Instead of:
  var  o = myProto | {
   a: 0,
   b: function () {}
  }

You could do:
  var  o = Object.make(myProto, {
   a: 0,
   b: function () {}
  })

A few more characters, but still addresses the major issue preventing wider 
Object.create usage (the use of property descriptors).  A library solution also 
keeps the benefit of not needing new syntax, and being available to 
text/javascript.  As noted in the strawman, similar functions on Array and 
Function could support the other scenarios described in the proposal.

It seems the syntax is perhaps aiming to avoid needing to allocate an 
intermediate object - but I imagine engines could potentially do that for 
Object.make and friends as well if it was important for performance?

Luke

From: es-discuss-boun...@mozilla.org [mailto:es-discuss-boun...@mozilla.org] On 
Behalf Of Allen Wirfs-Brock
Sent: Tuesday, May 17, 2011 7:50 PM
To: es-discuss@mozilla.org
Subject: prototype for operator proposal for review

We had so much fun with feedback on my Unicode proposal I just have open 
another one up for list feed back:

An updated version of the prototype for (formerly proto) operator proposal is 
at http://wiki.ecmascript.org/doku.php?id=strawman:proto_operator

Allen
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: prototype for operator proposal for review

2011-05-17 Thread Jeff Walden

On 05/17/2011 09:49 PM, Luke Hoban wrote:

It seems the syntax is perhaps aiming to avoid needing to allocate an 
intermediate object – but I imagine engines could potentially do that for 
Object.make and friends as well if it was important for performance?


It's probably possible to do that.  But such hacks are rather fragile.  I suspect this 
would take roughly the form of the way SpiderMonkey optimizes Function.prototype.apply, 
which is roughly to look for calls of properties named apply and do 
special-case behavior with a PIC in the case that that property is actually 
|Function.prototype.apply|.  It takes some pretty gnarly code, duplicated two places 
(possibly a third, but that might not be necessary), to make it all happen.  That sort of 
pattern certainly can be repeated if push comes to shove.  But I believe doing so is far 
inferior to dedicated, first-class syntactical support to make the semantics absolutely 
unambiguous and un-confusable with anything else.

In this particular case, I suspect implementing a PIC that way would be even gnarlier, 
because it wouldn't just be a PIC on the identity of the |Object.make| property, it'd 
have to also apply to computation of the arguments provided in the function 
call (or a not-call if you're using a PIC this way).  That too can probably 
be done.  But it'd be pretty tricky (thinking of things like the PIC only being 
applicable if the argument is an object literal, and of it being mostly inapplicable if 
it's anything else).  And if you wanted to extend that to apply to more functions than 
just a single Object.make function, the hacks will be even more complex, possibly not 
even by a constant increment.

And of course this would also make it harder for IDEs and such to give good 
first-class syntax highlighting here, because the syntax for this would be 
ambiguous with user-created stuff.

Anyway, food for thought.  And I know others here are more familiar with this 
than I am, so please chime in with more if you have it, or corrections if you 
have them.

Jeff
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss