subject:"Re\: Full Unicode based on UTF\-16 proposal"

On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg 
ecmascr...@norbertlindenberg.com wrote:

 The conformance clause doesn't say anything about the interpretation of
 (UTF-16) code units as code points. To check conformance with C1, you have
 to look at how the resulting code points are actually further interpreted.


True, but if the proposed language

A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a
surrogate pair, is interpreted as a code point with the same value.

is adopted, then will not this have an effect of creating unpaired
surrogates as code points? If so, then by my estimation, this *will* increase
the likelihood of their being interpreted as abstract characters... e.g.,
if the unpaired code unit is interpreted as a unpaired surrogate code
point, and some process/function performs *any* predicate or transform on
that code point, then that amounts to interpreting it as an abstract
character.

I would rather see such unpaired code unit either (1) be mapped to
U+00FFFD, or (2) an exception raised when performing an operation that
requires conversion of the UTF-16 code unit sequence.


 My proposal interprets the resulting code points in the following ways:

 1) In regular expressions, they can be used in both patterns and input
 strings to be matched. They may be compared against other code points, or
 against character classes, some of which will hopefully soon be defined by
 Unicode properties. In the case of comparing against other code points,
 they can't match any code points assigned to abstract characters. In the
 case of Unicode properties, they'll typically fall into the large bucket of
 have-nots, along with other unassigned code points or, for example, U+FFFD,
 unless you ask for their general category.

 2) When parsing identifiers, they will not have the ID_Start or
 ID_Continue properties, so they'll be excluded, just like other unassigned
 code points or U+FFFD.

 3) In case conversion, they won't have upper case or lower case
 equivalents defined, and remain as is, as would happen for unassigned code
 points or U+FFFD.

 I don't think either of these amount to interpretation as abstract
 characters. I mention U+FFFD because the alternative interpretation of
 unpaired surrogates would be to replace them with U+FFFD, but that doesn't
 seem to improve anything.

 Norbert



 On Mar 26, 2012, at 15:10 , Glenn Adams wrote:

  On Mon, Mar 26, 2012 at 2:02 PM, Gavin Barraclough 
 barraclo...@apple.com wrote:
  I really like the direction you're going in, but have one minor concern
 relating to regular expressions.
 
  In your proposal, you currently state:
 A code unit that is in the range 0xD800 to 0xDFFF, but is not
 part of a surrogate pair, is interpreted as a code point with the same
 value.
 
  Just as a reminder, this would be in explicit violation of the Unicode
 conformance clause C1 unless it can be guaranteed that such a code point
 will not be interpreted as an abstract character:
 
  C1A process shall not interpret a high-surrogate code point or a
 low-surrogate code point as an abstract character.
 
  [1] http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf
 
  Given that such guarantee is likely impractical, this presents a problem
 for the above proposed language.


___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012-03-27 Thread Steven Levithan


Erik Corry wrote:

Steven Levithan wrote:

- Make \d\w\b Unicode-aware.


I think we should leave these alone.  They are concise and useful and
will continue to be so when /u is the default in Harmony code.
Instead we should introduce \p{...} immediately which provides the
same functionality.


\w and \b are broken without Unicode. ASCII \d is concise and useful, but so 
is [0-9]. Unicode-aware \b can't be emulated using \p{..} unless lookbehind 
is also added (which is tentatively approved for ES6 but could get delayed). 
Unicode-aware \w\b\d are required by UTS#18. If \w\b\d are not made 
Unicode-aware by /u, we won't easily be able to fix them in the future.


We went down this road before, and at the end you agreed that \w\b\d with /u 
should be Unicode aware. :/


I agree with adding \p{..} as soon as possible, with two caveats:

* If I recall correctly, mobile browser implementers voiced concerns about 
overhead during the es4-discuss days.

* It can easily be pushed down the road to ES7+.

Delaying /u, on the other hand, might mean also having to delay Norbert's 
work on code point matching, etc. Introducing \p{..} without code point 
matching would be nonideal. \p{..} might *need* to be delayed anyway to 
allow RegExp proposals already approved by TC39 (match web reality, 
lookbehind, flag /y), the flag /x strawman, and flag /u to be completed in 
time. For starters, it's not clear which properties \p{..} in ES would 
support, and there would be a number of other details to discuss, too.


Erik Corry wrote:

Make unpaired surrogates in /u regexps a syntax error.


Sounds good to me.

-- Steven Levithan

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

This begs the question of what is the point of C1.

On Tue, Mar 27, 2012 at 9:13 AM, Mark Davis ☕ m...@macchiato.com wrote:

 That would not be practical, nor predictable. And note that the 700K
 reserved code points are also not to be interpreted as characters; by your
 logic all of them would need to be converted to FFFD.

 And in practice, an unpaired surrogate is best treated just like a
 reserved (unassigned) code point. For example, a lowercase operation should
 convert characters with lowercase correspondants to those correspondants,
 and leave *everything* else alone: control characters, format characters,
 reserved code points, surrogates, etc.

 --
 Mark https://plus.google.com/114199149796022210033
 *
 *
 *— Il meglio è l’inimico del bene —*
 **



 On Tue, Mar 27, 2012 at 08:02, Glenn Adams gl...@skynav.com wrote:



 On Tue, Mar 27, 2012 at 8:39 AM, Mark Davis ☕ m...@macchiato.com wrote:

 That, as Norbert explained, is not the intention of the standard. Take a
 look at the discussion of Unicode 16-bit string in chapter 3. The
 committee recognized that fragments may be formed when working with UTF-16,
 and that destructive changes may do more harm than good.

 x = a.substring(0, 5) + b + a.substring(5, a.length());
 y = x.substring(0, 5) + x.substring(6, x.length());

 After this operation is done, you want y == a, even if 5 is between D800
 and DC00.


 Assuming that b.length() == 1 in this example, my interpretation of this
 is that '=', '+', and 'substring' are operations whose domain and co-domain
 are (currently defined) ES Strings, namely sequences of UTF-16 code units.
 Since none of these operations entail interpreting the semantics of a code
 point (i.e., interpreting abstract characters), then there is no violation
 of C1 here.

 Or take:
 output = ;
 for (int i = 0; i  s.length(); ++i) {
   ch = s.charAt(i);
   if (ch.equals('')) {
 ch = '@';
   }
   output += ch;
 }

 After this operation is done, you want a\u{1}b to become 
 a@\u{1}b,
 not a\u{FFFD}\u{FFFD}b.
 It is also an unnecessary burden on lower-level software to always check
 this stuff.


 Again, in this example, I assume that the string literal a\u{1}b
 maps to the UTF-16 code unit sequence:

 0061 0026 D800 DC00 0062

 Given that 'charAt(i)' is defined on (and is indexing) code units and not
 code points, and since the 'equals' operator is also defined on code units,
 this example also does not require interpreting the semantics of code
 points (i.e., interpreting abstract characters).

 However, in Norbert's questions above about isUUppercase(int) and
 toUpperCase(int), it is clear that the domain of these operations are code
 points, not code units, and further, that they requiring interpretation as
 abstract characters in order to determine the semantics of the
 corresponding characters.

 My conclusion is that the determination of whether C1 is violated or not
 depends upon the domain, codomain, and operation being considered.


 Of course, when you convert to UTF-16 (or UTF-8 or 32) for storage or
 output, then you do need to either convert to FFFD or take some other
 action.

 --
 Mark https://plus.google.com/114199149796022210033
 *
 *
 *— Il meglio è l’inimico del bene —*
 **



 On Mon, Mar 26, 2012 at 23:11, Glenn Adams gl...@skynav.com wrote:


 On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg 
 ecmascr...@norbertlindenberg.com wrote:

 The conformance clause doesn't say anything about the interpretation
 of (UTF-16) code units as code points. To check conformance with C1, you
 have to look at how the resulting code points are actually further
 interpreted.


 True, but if the proposed language

 A code unit that is in the range 0xD800 to 0xDFFF, but is not part of
 a surrogate pair, is interpreted as a code point with the same value.

 is adopted, then will not this have an effect of creating unpaired
 surrogates as code points? If so, then by my estimation, this *will* 
 increase
 the likelihood of their being interpreted as abstract characters... e.g.,
 if the unpaired code unit is interpreted as a unpaired surrogate code
 point, and some process/function performs *any* predicate or transform
 on that code point, then that amounts to interpreting it as an abstract
 character.

 I would rather see such unpaired code unit either (1) be mapped to
 U+00FFFD, or (2) an exception raised when performing an operation that
 requires conversion of the UTF-16 code unit sequence.


 My proposal interprets the resulting code points in the following ways:

 1) In regular expressions, they can be used in both patterns and input
 strings to be matched. They may be compared against other code points, or
 against character classes, some of which will hopefully soon be defined by
 Unicode properties. In the case of comparing against other code points,
 they can't match any code points assigned to abstract characters. In the
 case of Unicode

Re: Full Unicode based on UTF-16 proposal

2012-03-27 Thread Mark Davis ☕

The point of C1 is that you can't interpret the surrogate code point U+DC00
as a *character*, like an a.

Neither can you interpret the reserved code point U+0378 as a *character*,
like a b.

--
Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Tue, Mar 27, 2012 at 08:56, Glenn Adams gl...@skynav.com wrote:

 This begs the question of what is the point of C1.


 On Tue, Mar 27, 2012 at 9:13 AM, Mark Davis ☕ m...@macchiato.com wrote:

 That would not be practical, nor predictable. And note that the 700K
 reserved code points are also not to be interpreted as characters; by your
 logic all of them would need to be converted to FFFD.

 And in practice, an unpaired surrogate is best treated just like a
 reserved (unassigned) code point. For example, a lowercase operation should
 convert characters with lowercase correspondants to those correspondants,
 and leave *everything* else alone: control characters, format characters,
 reserved code points, surrogates, etc.

 --
 Mark https://plus.google.com/114199149796022210033
 *
 *
 *— Il meglio è l’inimico del bene —*
 **



 On Tue, Mar 27, 2012 at 08:02, Glenn Adams gl...@skynav.com wrote:



 On Tue, Mar 27, 2012 at 8:39 AM, Mark Davis ☕ m...@macchiato.comwrote:

 That, as Norbert explained, is not the intention of the standard. Take
 a look at the discussion of Unicode 16-bit string in chapter 3. The
 committee recognized that fragments may be formed when working with UTF-16,
 and that destructive changes may do more harm than good.

 x = a.substring(0, 5) + b + a.substring(5, a.length());
 y = x.substring(0, 5) + x.substring(6, x.length());

 After this operation is done, you want y == a, even if 5 is between
 D800 and DC00.


 Assuming that b.length() == 1 in this example, my interpretation of this
 is that '=', '+', and 'substring' are operations whose domain and co-domain
 are (currently defined) ES Strings, namely sequences of UTF-16 code units.
 Since none of these operations entail interpreting the semantics of a code
 point (i.e., interpreting abstract characters), then there is no violation
 of C1 here.

 Or take:
 output = ;
 for (int i = 0; i  s.length(); ++i) {
   ch = s.charAt(i);
   if (ch.equals('')) {
 ch = '@';
   }
   output += ch;
 }

 After this operation is done, you want a\u{1}b to become 
 a@\u{1}b,
 not a\u{FFFD}\u{FFFD}b.
 It is also an unnecessary burden on lower-level software to always
 check this stuff.


 Again, in this example, I assume that the string literal a\u{1}b
 maps to the UTF-16 code unit sequence:

 0061 0026 D800 DC00 0062

 Given that 'charAt(i)' is defined on (and is indexing) code units and
 not code points, and since the 'equals' operator is also defined on code
 units, this example also does not require interpreting the semantics of
 code points (i.e., interpreting abstract characters).

 However, in Norbert's questions above about isUUppercase(int) and
 toUpperCase(int), it is clear that the domain of these operations are code
 points, not code units, and further, that they requiring interpretation as
 abstract characters in order to determine the semantics of the
 corresponding characters.

 My conclusion is that the determination of whether C1 is violated or not
 depends upon the domain, codomain, and operation being considered.


 Of course, when you convert to UTF-16 (or UTF-8 or 32) for storage or
 output, then you do need to either convert to FFFD or take some other
 action.

 --
 Mark https://plus.google.com/114199149796022210033
 *
 *
 *— Il meglio è l’inimico del bene —*
 **



 On Mon, Mar 26, 2012 at 23:11, Glenn Adams gl...@skynav.com wrote:


 On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg 
 ecmascr...@norbertlindenberg.com wrote:

 The conformance clause doesn't say anything about the interpretation
 of (UTF-16) code units as code points. To check conformance with C1, you
 have to look at how the resulting code points are actually further
 interpreted.


 True, but if the proposed language

 A code unit that is in the range 0xD800 to 0xDFFF, but is not part of
 a surrogate pair, is interpreted as a code point with the same value.

 is adopted, then will not this have an effect of creating unpaired
 surrogates as code points? If so, then by my estimation, this *will* 
 increase
 the likelihood of their being interpreted as abstract characters... e.g.,
 if the unpaired code unit is interpreted as a unpaired surrogate code
 point, and some process/function performs *any* predicate or
 transform on that code point, then that amounts to interpreting it as an
 abstract character.

 I would rather see such unpaired code unit either (1) be mapped to
 U+00FFFD, or (2) an exception raised when performing an operation that
 requires conversion of the UTF-16 code unit sequence.


 My proposal interprets the resulting code points in the following
 ways:

 1)

Re: Full Unicode based on UTF-16 proposal

2012-03-27 Thread Gavin Barraclough

On Mar 26, 2012, at 11:57 PM, Erik Corry wrote:
 Add /U to mean old-style regexp literals in Harmony code (analogous to
 /s and /S which have opposite meanings).

Are we sure this has enough utility to be worth adding? - it seems unlikely 
that programmers are going to often have cause to explicitly opt-out of correct 
unicode support (since little consideration usually seems to be given to this 
topic), and as discussed previously, a mechanism to do so already exists if 
they need it (RegExp(foo) will behave the same as the proposed /foo/U).  If 
we do add a 'U' flag, I'd worry that it may end up more commonly being used in 
error when people intended to append a 'u'!

cheers,
G.

  
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

So, if as a result of a policy of converting any UTF-16 code unit sequence
to a code point sequence one ends up with an unpaired surrogate, e.g.,
\u{00DC00}, then performing a predicate on that code point, such as
described in D21 (e.g., IsAlphabetic) would entail interpreting it as an
abstract character?

I can see that D20 defines code point properties which would not entail
interpreting as an abstract character, e.g., IsSurrogate, IsNonCharacter,
but where does one draw the line?

On Tue, Mar 27, 2012 at 11:15 AM, Mark Davis ☕ m...@macchiato.com wrote:

 The point of C1 is that you can't interpret the surrogate code point
 U+DC00 as a *character*, like an a.

 Neither can you interpret the reserved code point U+0378 as a *character*,
 like a b.


 --
 Mark https://plus.google.com/114199149796022210033
 *
 *
 *— Il meglio è l’inimico del bene —*
 **



 On Tue, Mar 27, 2012 at 08:56, Glenn Adams gl...@skynav.com wrote:

 This begs the question of what is the point of C1.


 On Tue, Mar 27, 2012 at 9:13 AM, Mark Davis ☕ m...@macchiato.com wrote:

 That would not be practical, nor predictable. And note that the 700K
 reserved code points are also not to be interpreted as characters; by your
 logic all of them would need to be converted to FFFD.

 And in practice, an unpaired surrogate is best treated just like a
 reserved (unassigned) code point. For example, a lowercase operation should
 convert characters with lowercase correspondants to those correspondants,
 and leave *everything* else alone: control characters, format characters,
 reserved code points, surrogates, etc.

 --
 Mark https://plus.google.com/114199149796022210033
 *
 *
 *— Il meglio è l’inimico del bene —*
 **



 On Tue, Mar 27, 2012 at 08:02, Glenn Adams gl...@skynav.com wrote:



 On Tue, Mar 27, 2012 at 8:39 AM, Mark Davis ☕ m...@macchiato.comwrote:

 That, as Norbert explained, is not the intention of the standard. Take
 a look at the discussion of Unicode 16-bit string in chapter 3. The
 committee recognized that fragments may be formed when working with 
 UTF-16,
 and that destructive changes may do more harm than good.

 x = a.substring(0, 5) + b + a.substring(5, a.length());
 y = x.substring(0, 5) + x.substring(6, x.length());

 After this operation is done, you want y == a, even if 5 is between
 D800 and DC00.


 Assuming that b.length() == 1 in this example, my interpretation of
 this is that '=', '+', and 'substring' are operations whose domain and
 co-domain are (currently defined) ES Strings, namely sequences of UTF-16
 code units. Since none of these operations entail interpreting the
 semantics of a code point (i.e., interpreting abstract characters), then
 there is no violation of C1 here.

 Or take:
 output = ;
 for (int i = 0; i  s.length(); ++i) {
   ch = s.charAt(i);
   if (ch.equals('')) {
 ch = '@';
   }
   output += ch;
 }

 After this operation is done, you want a\u{1}b to become 
 a@\u{1}b,
 not a\u{FFFD}\u{FFFD}b.
 It is also an unnecessary burden on lower-level software to always
 check this stuff.


 Again, in this example, I assume that the string literal a\u{1}b
 maps to the UTF-16 code unit sequence:

 0061 0026 D800 DC00 0062

 Given that 'charAt(i)' is defined on (and is indexing) code units and
 not code points, and since the 'equals' operator is also defined on code
 units, this example also does not require interpreting the semantics of
 code points (i.e., interpreting abstract characters).

 However, in Norbert's questions above about isUUppercase(int) and
 toUpperCase(int), it is clear that the domain of these operations are code
 points, not code units, and further, that they requiring interpretation as
 abstract characters in order to determine the semantics of the
 corresponding characters.

 My conclusion is that the determination of whether C1 is violated or
 not depends upon the domain, codomain, and operation being considered.


 Of course, when you convert to UTF-16 (or UTF-8 or 32) for storage or
 output, then you do need to either convert to FFFD or take some other
 action.

 --
 Mark https://plus.google.com/114199149796022210033
 *
 *
 *— Il meglio è l’inimico del bene —*
 **



 On Mon, Mar 26, 2012 at 23:11, Glenn Adams gl...@skynav.com wrote:


 On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg 
 ecmascr...@norbertlindenberg.com wrote:

 The conformance clause doesn't say anything about the interpretation
 of (UTF-16) code units as code points. To check conformance with C1, you
 have to look at how the resulting code points are actually further
 interpreted.


 True, but if the proposed language

 A code unit that is in the range 0xD800 to 0xDFFF, but is not part
 of a surrogate pair, is interpreted as a code point with the same value.

 is adopted, then will not this have an effect of creating unpaired
 surrogates as code points? If so, then by my estimation, this *will* 
 increase

Re: Full Unicode based on UTF-16 proposal

ok, i'll accept your position at this point and drop my comment; i suppose
it is true that if there are already unpaired surrogates in user data as
UTF-16, then having unpaired surrogates as code points is no worse;

however, it would be useful if there were an informative pointer from the
spec under consideration to a UTC sanctioned list of operations that
constitute interpreting as abstract characters and, that, if used on such
data would possibly violate C1; to this end, it would be useful if C1
itself included a concrete example of such an operation

On Tue, Mar 27, 2012 at 2:02 PM, Mark Davis ☕ m...@macchiato.com wrote:

 performing a predicate on that code point, such as described in D21
 (e.g., IsAlphabetic) would entail interpreting it as an abstract character?
 No.

  but where does one draw the line?
  The line is already drawn by the Unicode consortium, by consulting the 
 Unicode
 Character Database properties. If you look at the data in the Unicode
 Character Database for any particular property, say Alphabetic, you'll find
 that surrogate code points are not included where the property is a true
 character property. There are a few special cases where reserved code
 points are provisionally given anticipatory character properties, such as
 in bidi ranges, simply because that makes implementations is more forward
 compatible, but there aren't any cases where a character property applies
 to a surrogate code point (other than by returning No, or n/a, or some
 such).

 --
 Mark https://plus.google.com/114199149796022210033
 *
 *
 *— Il meglio è l’inimico del bene —*
 **



 On Tue, Mar 27, 2012 at 12:07, Glenn Adams gl...@skynav.com wrote:

 So, if as a result of a policy of converting any UTF-16 code unit
 sequence to a code point sequence one ends up with an unpaired surrogate,
 e.g., \u{00DC00}, then performing a predicate on that code point, such as
 described in D21 (e.g., IsAlphabetic) would entail interpreting it as an
 abstract character?

 I can see that D20 defines code point properties which would not entail
 interpreting as an abstract character, e.g., IsSurrogate, IsNonCharacter,
 but where does one draw the line?


  On Tue, Mar 27, 2012 at 11:15 AM, Mark Davis ☕ m...@macchiato.comwrote:

 The point of C1 is that you can't interpret the surrogate code point
 U+DC00 as a *character*, like an a.

 Neither can you interpret the reserved code point U+0378 as a
 *character*, like a b.


 --
 Mark https://plus.google.com/114199149796022210033
 *
 *
 *— Il meglio è l’inimico del bene —*
 **



 On Tue, Mar 27, 2012 at 08:56, Glenn Adams gl...@skynav.com wrote:

 This begs the question of what is the point of C1.


 On Tue, Mar 27, 2012 at 9:13 AM, Mark Davis ☕ m...@macchiato.comwrote:

 That would not be practical, nor predictable. And note that the 700K
 reserved code points are also not to be interpreted as characters; by your
 logic all of them would need to be converted to FFFD.

 And in practice, an unpaired surrogate is best treated just like a
 reserved (unassigned) code point. For example, a lowercase operation 
 should
 convert characters with lowercase correspondants to those correspondants,
 and leave *everything* else alone: control characters, format characters,
 reserved code points, surrogates, etc.

 --
 Mark https://plus.google.com/114199149796022210033
 *
 *
 *— Il meglio è l’inimico del bene —*
 **



 On Tue, Mar 27, 2012 at 08:02, Glenn Adams gl...@skynav.com wrote:



 On Tue, Mar 27, 2012 at 8:39 AM, Mark Davis ☕ m...@macchiato.comwrote:

 That, as Norbert explained, is not the intention of the standard.
 Take a look at the discussion of Unicode 16-bit string in chapter 3. 
 The
 committee recognized that fragments may be formed when working with 
 UTF-16,
 and that destructive changes may do more harm than good.

 x = a.substring(0, 5) + b + a.substring(5, a.length());
 y = x.substring(0, 5) + x.substring(6, x.length());

 After this operation is done, you want y == a, even if 5 is between
 D800 and DC00.


 Assuming that b.length() == 1 in this example, my interpretation of
 this is that '=', '+', and 'substring' are operations whose domain and
 co-domain are (currently defined) ES Strings, namely sequences of UTF-16
 code units. Since none of these operations entail interpreting the
 semantics of a code point (i.e., interpreting abstract characters), then
 there is no violation of C1 here.

 Or take:
 output = ;
 for (int i = 0; i  s.length(); ++i) {
   ch = s.charAt(i);
   if (ch.equals('')) {
 ch = '@';
   }
   output += ch;
 }

 After this operation is done, you want a\u{1}b to become 
 a@\u{1}b,
 not a\u{FFFD}\u{FFFD}b.
 It is also an unnecessary burden on lower-level software to always
 check this stuff.


 Again, in this example, I assume that the string literal
 a\u{1}b maps to the UTF-16 code unit sequence:

 0061 0026 D800 DC00 0062

 Given that

Re: Full Unicode based on UTF-16 proposal

Perfectly valid concerns.

My thinking here is that normally applications want to deal with code points, 
but we force them to deal with UTF-16 and additional flags because we need them 
for compatibility. Within modules, where we know that compatibility is not an 
issue, I'd rather give applications by default what they need.

Looking back at Java, supporting supplementary characters was fairly painless 
for many applications despite UTF-16 because Java already had a rich API 
performing all kinds of operations on strings, so many applications had little 
need to look at individual characters in the first place. We went through the 
entire Java SE API and fixed all those operations to use code point semantics 
(look for under the hood at [1] for details). We were also able to switch 
regular expressions to code point semantics without any flags because regular 
expressions never worked on binary data and developers hadn't created funky 
workarounds to support supplementary characters yet. JavaScript today has more 
constraints, but for new development it would still be good to get as close as 
possible to that experience.

Norbert

[1] http://java.sun.com/developer/technicalArticles/Intl/Supplementary/


On Mar 24, 2012, at 23:56 , David Herman wrote:

 On Mar 24, 2012, at 4:32 PM, Norbert Lindenberg wrote:
 
 One concern: I think code point based matching should be the default for 
 regex literals within modules (where we know the code is written for 
 Harmony).
 
 This idea makes me nervous. Partly because I think we should keep the set of 
 semantic changes between non-module code and module code reasonable small, 
 and partly because the idea of your proposal is to continue to treat strings 
 as sequences of 16-bit code units, not Unicode code points-- which means that 
 quietly switching regexps to be closer to operating at the level of code 
 points seems like it creates a kind of impedance mismatch. It feels more 
 appropriate to me to require programmers to declare explicitly that they're 
 dealing with a string at the level of code points, using the (quite concise) 
 /u flag. That way they're saying yes, I know this string is just a sequence 
 of 16-bit code points, but it may contain non-BMP data, and I would like to 
 match its contents with a regexp that deals with code points.
 
 (Again, I'm still new to the finer points of Unicode, so I'm prepared to be 
 shown I'm thinking about it wrong.)
 
 Dave
 

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

Let's see:

- Conversion to UTF-8: If the string isn't well-formed, you wouldn't refuse to 
convert it, so isValid doesn't really help. You still have to look at all code 
units, and convert unpaired surrogates to the UTF-8 sequence for U+FFFD.

- Conversion from UTF-8: For security reasons, you have to check for 
well-formedness before conversion, in particular to catch non-shortest forms 
[1].

- HTML form data: Same situation as conversion to UTF-8.

- Base64 encodes binary data, so UTF-16 well-formedness rules don't apply.

I don't think we'd add API just to flag an issue - that's what documentation is 
for.

Norbert

[1] http://www.unicode.org/reports/tr36/#UTF-8_Exploit



On Mar 25, 2012, at 1:57 , Roger Andrews wrote:

 I use something like String.isValid functionality in a transcoder that
 converts Strings to/from UTF-8, HTML Formdata (MIME type
 application/x-www-form-urlencoded -- not the same as URI encoding!), and
 Base64.
 
 Admittedly these currently use 'encodeURI' to do the work, or it just drops
 out naturally when considering UTF-8 sequences.
 
 (I considered testing the regexp
 /^(?:[\u-\uD7FF\uE000-\u]|[\uD800-\uDBFF][\uDC00-\uDFFF])*$/
 against the input string.)
 
 Maybe the function is too obscure for general use, although its presence does 
 flag up the surrogate-pair issue to developers.
 
 --
 From: Norbert Lindenberg ecmascr...@norbertlindenberg.com
 
 It's easy to provide this function, but in which situations would it be
 useful? In most cases that I can think of you're interested in far more
 constrained definitions of validity:
 - what are valid ECMAScript identifiers?
 - what are valid BCP 47 language tags?
 - what are the characters allowed in a certain protocol?
 - what are the characters that my browser can render?
 
 Thanks,
 Norbert
 
 
 On Mar 24, 2012, at 12:12 , David Herman wrote:
 
 On Mar 23, 2012, at 11:45 AM, Roger Andrews wrote:
 
 Concerning UTF-16 surrogate pairs, how about a function like:
String.isValid( str )
 to discover whether surrogates are used correctly in 'str'?
 
 Something like Array.isArray().
 
 No need for it to be a class method, since it only operates on strings.
 We could simply have String.prototype.isValid(). Note that it would work
 for primitive strings as well, thanks to JS's automatic promotion
 semantics.
 
 Dave
 

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

There is a strawman for code point escapes:
http://wiki.ecmascript.org/doku.php?id=strawman:full_unicode_source_code#unicode_escape_sequences

Note that for references to specific characters it's usually best to just use 
the characters directly, as Dave did in 팆팇팈팉팊.match(/[팆-퍖]+/u). Escapes can 
be useful in cases such as regular expressions where you might have to refer to 
range limits that aren't actually assigned characters, or in test cases where 
you might use characters for which your OS doesn't have glyphs yet.

Norbert


On Mar 25, 2012, at 2:57 , Roger Andrews wrote:

 Doesn't C/C++ allow non-BMP code points using \U in character 
 literals.  The \U format expresses a full 32-bit code, which could be mapped 
 internally to two 16-bit UTF-16 codes.
 
 Then the programmer can describe exactly the required characters without 
 caring about their coding in UTF-16 or whatever.
 
 Could you use this to avoid complicated things in RegExps like 
 [{\u\u}-{\u\u}], instead have things like 
 [\U0001-\U0003] -- naturally expressing the characters of interest?
 
 The same goes for String literals, where the programmer does not really care 
 about the encoding, just specifying the character.
 
 (Sorry if I've missed something in the prior discussion.)
 
 --
 From: Norbert Lindenberg
 To: David Herman
 
 On Mar 24, 2012, at 12:21 , David Herman wrote:
 
 [snip]
 
 As for whether the switch to code-point-based matching should be universal 
 or require /u (an issue that your proposal leaves open), IMHO it's better 
 to require /u since it avoids the need for transforming 
 \u[\u-\u] to [{\u\u}-{\u\u}] and 
 [\u-\u][\uDC00-\uDFFF] to [{\u\uDC00}-{\u\uDFFF}], and 
 additionally avoids as least three potentially breaking changes (two of 
 which are explicitly mentioned in your proposal):
 
 I haven't completely understood this part of the discussion. Looking at /u 
 as a little red switch (LRS), i.e., an opportunity to make judicious 
 breaks with compatibility, could we not allow character classes with 
 unescaped non-BMP code points, e.g.:
 
   js 팆팇팈팉팊.match(/[팆-퍖]+/u)
   [팆팇팈팉팊]
 
 I'm still getting up to speed on Unicode and JS string semantics, so I'm 
 guessing that I'm missing a reason why that wouldn't work... Presumably the 
 JS source of the regexp literal, as a sequence of UTF-16 code units, 
 represents the tetragram code points as surrogate pairs. Can we not 
 recognize surrogate pairs in character classes within a /u regexp and 
 interpret them as code points?
 
 With /u, that's exactly what happens. My first proposal was to make this 
 happen even without a new flag, i.e., make
 팆팇팈팉팊.match(/[팆-퍖]+/)
 work based on code points, and Steve is arguing against that because of 
 compatibility risk. My proposal also includes some transformations to keep 
 existing regular expressions working, and Steve correctly observes that if 
 we have a flag for code point mode, then the transformation is not needed - 
 old regular expressions would continue to work in code unit mode, while new 
 regular expressions with /u get code point treatment.

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012-03-26 Thread Roger Andrews


The strawman is for source code characters, and says it has no implications
for string value encodings (or RegExps).
String  regexp literal escape sequences are explicitly defined in ES5 
sections 7.8.4  7.8.5.
Will Strawman style also work in ES6 string  regexp literals?  Thus making 
regexp ranges much nicer (see final example below).



As well as describing code points that have not yet been defined as
characters, character escapes in string literals and regexps are good:
1)  control characters don't have glyphs at all,
2)  the various space glyphs are not readily distinguishable (same for some 
dash/minus/line glyphs),

3)  breaking/non-breaking versions of characters are not distinguishable,
4)  many other glyphs are hard to distinguish (being tiny adjustments in
positioning or form detail),
5)  some characters are combining -- which makes for a messy and confusing
program if you use them raw.

If you use the raw non-ASCII characters in a program then you need some 
means of creating them, preferably via a normal keyboard and in your 
favourite text editor.
All program readers need appropriate fonts installed to fully understand 
the program, and program maintainers also need a Unicode-capable text editor 
(potentially including non-BMP support).

All links/stores that the program travels over or rests in must be
Unicode-capable.
Whereas using only ASCII chars to write a program is easy to do and always
works no matter how basic your computing/transmission infrastructure. 
(ASCII chars never get silently mangled in transmission or text editors.)


How to represent character escapes in a language.
C/C++ has:
   \xNN8-bit char (U+ - U+00FF)
   \u 16-bit char (U+ - U+)
   \U32-bit char (i.e. any 21-bit Unicode char)
Strawman for source chars has:
   \u{N...}   8 to 24-bit char (i.e. any 21-bit Unicode char)


I'm struggling with how non-BMP escapes would be used in practice in strings
 regexps -- especially regexp ranges.  Will Strawman style be used in 
string  regexp literals?


Considering U+1D307 (팇) as an example (where 팇 == \uD834\uDF07).

To create the string I like 팇 using escapes
in C/C++ you can create a string:
  I like \U0001D307
if the Strawman style works in strings, in ES6 presumably you say:
  I like \u{1D307}
or do you have to know UTF-16 encoding rules and say:
  I like \uD834\uDF07

To use U+1D307 (팇) and U+1D356 (퍖) as a range in a regexp, i.e. /[팇-퍖]/
should the programmer write:
C/C++ style
   /[\U0001D307-\U0001D356]/
or will Strawman style work in regexps
   /[\u{1D307}-\u{1D356}]/
or in UTF-16 with {} grouping
   /[{\uD834\uDF07}-{\uD834\uDF56}]/

Either C/C++ style or Strawman style escape is readable, natural, doesn't
require knowledge of UTF-16 encoding rules, can be created easily with any 
old keyboard, and won't upset text editors.


It's a bit unfriendly to require programmers to know UTF-16 rules just to
put a non-BMP character in a string or regexp using an escape.  And in a
regexp range it looks ugly and confusing.


--
From: Norbert Lindenberg


There is a strawman for code point escapes:
http://wiki.ecmascript.org/doku.php?id=strawman:full_unicode_source_code#unicode_escape_sequences

Note that for references to specific characters it's usually best to just
use the characters directly, as Dave did in
팆팇팈팉팊.match(/[팆-퍖]+/u). Escapes can be useful in cases such as
regular expressions where you might have to refer to range limits that
aren't actually assigned characters, or in test cases where you might use
characters for which your OS doesn't have glyphs yet.

Norbert


On Mar 25, 2012, at 2:57 , Roger Andrews wrote:


Doesn't C/C++ allow non-BMP code points using \U in character
literals.  The \U format expresses a full 32-bit code, which could be
mapped internally to two 16-bit UTF-16 codes.

Then the programmer can describe exactly the required characters without
caring about their coding in UTF-16 or whatever.

Could you use this to avoid complicated things in RegExps like
[{\u\u}-{\u\u}], instead have things like
[\U0001-\U0003] -- naturally expressing the characters of
interest?

The same goes for String literals, where the programmer does not really
care about the encoding, just specifying the character.

(Sorry if I've missed something in the prior discussion.)

--
From: Norbert Lindenberg
To: David Herman


On Mar 24, 2012, at 12:21 , David Herman wrote:

[snip]


As for whether the switch to code-point-based matching should be
universal or require /u (an issue that your proposal leaves open),
IMHO it's better to require /u since it avoids the need for
transforming \u[\u-\u] to [{\u\u}-{\u\u}]
and [\u-\u][\uDC00-\uDFFF] to [{\u\uDC00}-{\u\uDFFF}],
and additionally

Re: Full Unicode based on UTF-16 proposal

2012-03-26 Thread Roger Andrews

Maybe String.isValid is just not generally useful enough.  I accept the 
point that you don't add APIs simply to flag an issue, (there has to be more 
weighty justification to carry the trifle).



PS:
As for UTF-16 - UTF-8 or HTML-Formdata, I decided to follow encodeURI / 
encodeURIComponent's lead and throw an exception.  Maybe that's the wrong 
thing to do?


My UTF-8 - UTF-16 does check for well-formed UTF-8 because it seemed the 
right thing to do.  Thanks for the link which explains why.


Base64 encodes 8-bit octets, so UTF-16 first gets converted to UTF-8, same 
issues as above really.


--
From: Norbert Lindenberg


Let's see:

- Conversion to UTF-8: If the string isn't well-formed, you wouldn't 
refuse to convert it, so isValid doesn't really help. You still have to 
look at all code units, and convert unpaired surrogates to the UTF-8 
sequence for U+FFFD.


- Conversion from UTF-8: For security reasons, you have to check for 
well-formedness before conversion, in particular to catch non-shortest 
forms [1].


- HTML form data: Same situation as conversion to UTF-8.

- Base64 encodes binary data, so UTF-16 well-formedness rules don't apply.

I don't think we'd add API just to flag an issue - that's what 
documentation is for.


Norbert

[1] http://www.unicode.org/reports/tr36/#UTF-8_Exploit



On Mar 25, 2012, at 1:57 , Roger Andrews wrote:


I use something like String.isValid functionality in a transcoder that
converts Strings to/from UTF-8, HTML Formdata (MIME type
application/x-www-form-urlencoded -- not the same as URI encoding!), and
Base64.

Admittedly these currently use 'encodeURI' to do the work, or it just 
drops

out naturally when considering UTF-8 sequences.

(I considered testing the regexp
/^(?:[\u-\uD7FF\uE000-\u]|[\uD800-\uDBFF][\uDC00-\uDFFF])*$/
against the input string.)

Maybe the function is too obscure for general use, although its presence 
does flag up the surrogate-pair issue to developers.


--
From: Norbert Lindenberg ecmascr...@norbertlindenberg.com


It's easy to provide this function, but in which situations would it be
useful? In most cases that I can think of you're interested in far more
constrained definitions of validity:
- what are valid ECMAScript identifiers?
- what are valid BCP 47 language tags?
- what are the characters allowed in a certain protocol?
- what are the characters that my browser can render?

Thanks,
Norbert


On Mar 24, 2012, at 12:12 , David Herman wrote:


On Mar 23, 2012, at 11:45 AM, Roger Andrews wrote:


Concerning UTF-16 surrogate pairs, how about a function like:
   String.isValid( str )
to discover whether surrogates are used correctly in 'str'?

Something like Array.isArray().


No need for it to be a class method, since it only operates on strings.
We could simply have String.prototype.isValid(). Note that it would 
work

for primitive strings as well, thanks to JS's automatic promotion
semantics.

Dave




___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012-03-26 Thread Roger Andrews


Steven Levithan wrote:

[snip]
* /\u{10}/ eq /u{10}/ (literal u repeated 10 times).


A point in favour of \U over \u{x...} as a representation of 
character escapes? -- to avoid ambiguity in regexps.



___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

OK, I guess we have to have Unicode code point escapes :-)

I'd expect them to work in identifiers, string literals, and regular 
expressions (possibly with restrictions coming out of today's emails), but not 
in JSON source.

Norbert


On Mar 26, 2012, at 4:45 , Roger Andrews wrote:

 The strawman is for source code characters, and says it has no implications
 for string value encodings (or RegExps).
 String  regexp literal escape sequences are explicitly defined in ES5 
 sections 7.8.4  7.8.5.
 Will Strawman style also work in ES6 string  regexp literals?  Thus making 
 regexp ranges much nicer (see final example below).
 
 
 As well as describing code points that have not yet been defined as
 characters, character escapes in string literals and regexps are good:
 1)  control characters don't have glyphs at all,
 2)  the various space glyphs are not readily distinguishable (same for some 
 dash/minus/line glyphs),
 3)  breaking/non-breaking versions of characters are not distinguishable,
 4)  many other glyphs are hard to distinguish (being tiny adjustments in
 positioning or form detail),
 5)  some characters are combining -- which makes for a messy and confusing
 program if you use them raw.
 
 If you use the raw non-ASCII characters in a program then you need some means 
 of creating them, preferably via a normal keyboard and in your favourite text 
 editor.
 All program readers need appropriate fonts installed to fully understand the 
 program, and program maintainers also need a Unicode-capable text editor 
 (potentially including non-BMP support).
 All links/stores that the program travels over or rests in must be
 Unicode-capable.
 Whereas using only ASCII chars to write a program is easy to do and always
 works no matter how basic your computing/transmission infrastructure. (ASCII 
 chars never get silently mangled in transmission or text editors.)
 
 How to represent character escapes in a language.
 C/C++ has:
   \xNN8-bit char (U+ - U+00FF)
   \u 16-bit char (U+ - U+)
   \U32-bit char (i.e. any 21-bit Unicode char)
 Strawman for source chars has:
   \u{N...}   8 to 24-bit char (i.e. any 21-bit Unicode char)
 
 
 I'm struggling with how non-BMP escapes would be used in practice in strings
  regexps -- especially regexp ranges.  Will Strawman style be used in string 
  regexp literals?
 
 Considering U+1D307 (팇) as an example (where 팇 == \uD834\uDF07).
 
 To create the string I like 팇 using escapes
 in C/C++ you can create a string:
  I like \U0001D307
 if the Strawman style works in strings, in ES6 presumably you say:
  I like \u{1D307}
 or do you have to know UTF-16 encoding rules and say:
  I like \uD834\uDF07
 
 To use U+1D307 (팇) and U+1D356 (퍖) as a range in a regexp, i.e. /[팇-퍖]/
 should the programmer write:
 C/C++ style
   /[\U0001D307-\U0001D356]/
 or will Strawman style work in regexps
   /[\u{1D307}-\u{1D356}]/
 or in UTF-16 with {} grouping
   /[{\uD834\uDF07}-{\uD834\uDF56}]/
 
 Either C/C++ style or Strawman style escape is readable, natural, doesn't
 require knowledge of UTF-16 encoding rules, can be created easily with any 
 old keyboard, and won't upset text editors.
 
 It's a bit unfriendly to require programmers to know UTF-16 rules just to
 put a non-BMP character in a string or regexp using an escape.  And in a
 regexp range it looks ugly and confusing.
 
 
 --
 From: Norbert Lindenberg
 
 There is a strawman for code point escapes:
 http://wiki.ecmascript.org/doku.php?id=strawman:full_unicode_source_code#unicode_escape_sequences
 
 Note that for references to specific characters it's usually best to just
 use the characters directly, as Dave did in
 팆팇팈팉팊.match(/[팆-퍖]+/u). Escapes can be useful in cases such as
 regular expressions where you might have to refer to range limits that
 aren't actually assigned characters, or in test cases where you might use
 characters for which your OS doesn't have glyphs yet.
 
 Norbert
 
 
 On Mar 25, 2012, at 2:57 , Roger Andrews wrote:
 
 Doesn't C/C++ allow non-BMP code points using \U in character
 literals.  The \U format expresses a full 32-bit code, which could be
 mapped internally to two 16-bit UTF-16 codes.
 
 Then the programmer can describe exactly the required characters without
 caring about their coding in UTF-16 or whatever.
 
 Could you use this to avoid complicated things in RegExps like
 [{\u\u}-{\u\u}], instead have things like
 [\U0001-\U0003] -- naturally expressing the characters of
 interest?
 
 The same goes for String literals, where the programmer does not really
 care about the encoding, just specifying the character.
 
 (Sorry if I've missed something in the prior discussion.)
 
 --
 From: Norbert Lindenberg
 To: David Herman
 
 On Mar 24, 2012, at 12:21 ,

Re: Full Unicode based on UTF-16 proposal


On Mar 26, 2012, at 13:02 , Gavin Barraclough wrote:

 Hi Norbert,
 
 I really like the direction you're going in, but have one minor concern 
 relating to regular expressions.
 
 In your proposal, you currently state:
   A code unit that is in the range 0xD800 to 0xDFFF, but is not part of 
 a surrogate pair, is interpreted as a code point with the same value.
 
 I think this makes sense in the context of your original proposal, which 
 seeks to be backwards compatible with existing regular expressions through 
 the range transformations.  But I'm concerned that this might prove 
 problematic, and would suggest that if we're going to make unicode regexp 
 match opt-in through a /u flag then instead it may be better to make unpaired 
 surrogates in unicode expressions a syntax error.

That's worth considering. It seems we're more and more moving towards two 
separate RegExp versions anyway - a legacy version based on code units and with 
all kinds of quirks, and an all-around-better version based on code points. It 
means however that you can't easily remove unpaired surrogates by
   str.replace(/[\u{D800}-\u{DFFF}]/ug, \u{FFFD})

 My concern would be expressions such as:
   /[\uD800\uDC00\uDC00\uD800]/u
 Under my reading of the current proposal, this could match any of 
 \uD800\uDC00, \uD800, or \uDC00.  Allowing this seems to introduce the 
 concept of precedence to character classes (given an input \uD800\uDC00, 
 should I choose to match \uD800\uDC00 or \uD800?).  It may also 
 significantly complicate the implementation of backtracking if we were to 
 allow this (if I have matched \uD800\uDC00, should I step back by one code 
 unit or two?).

I think/hope that my specification is clear: a surrogate pair is always treated 
as one entity, not as two pieces. If the input is \uD800\uDC00, you match 
\uD800\uDC00. If you have to backtrack over \uD800\uDC00, you step back two 
code units.

 It also just seems much clearer from a user perspective to say that 
 non-unicode regular expressions match code units, unicode regular expressions 
 match code points - mixing the two seems unhelpful.
 
 If opt-in is automatic in modules, programmers will likely want an escape to 
 be able to write non-unicode regular expressions, but I don't think this 
 should warrant an extra flag, I don't think we can automatically change the 
 behaviour of the RegExp constructor (without a u flag being passed), so 
 RegExp(\uD800) should still be available to support non-unicode matching 
 within modules.

Agreed, especially after reading Erik's and your additional emails on this.

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

The conformance clause doesn't say anything about the interpretation of 
(UTF-16) code units as code points. To check conformance with C1, you have to 
look at how the resulting code points are actually further interpreted.

My proposal interprets the resulting code points in the following ways:

1) In regular expressions, they can be used in both patterns and input strings 
to be matched. They may be compared against other code points, or against 
character classes, some of which will hopefully soon be defined by Unicode 
properties. In the case of comparing against other code points, they can't 
match any code points assigned to abstract characters. In the case of Unicode 
properties, they'll typically fall into the large bucket of have-nots, along 
with other unassigned code points or, for example, U+FFFD, unless you ask for 
their general category.

2) When parsing identifiers, they will not have the ID_Start or ID_Continue 
properties, so they'll be excluded, just like other unassigned code points or 
U+FFFD.

3) In case conversion, they won't have upper case or lower case equivalents 
defined, and remain as is, as would happen for unassigned code points or U+FFFD.

I don't think either of these amount to interpretation as abstract characters. 
I mention U+FFFD because the alternative interpretation of unpaired surrogates 
would be to replace them with U+FFFD, but that doesn't seem to improve anything.

Norbert



On Mar 26, 2012, at 15:10 , Glenn Adams wrote:

 On Mon, Mar 26, 2012 at 2:02 PM, Gavin Barraclough barraclo...@apple.com 
 wrote:
 I really like the direction you're going in, but have one minor concern 
 relating to regular expressions.
 
 In your proposal, you currently state:
A code unit that is in the range 0xD800 to 0xDFFF, but is not part of 
 a surrogate pair, is interpreted as a code point with the same value.
 
 Just as a reminder, this would be in explicit violation of the Unicode 
 conformance clause C1 unless it can be guaranteed that such a code point will 
 not be interpreted as an abstract character:
 
 C1A process shall not interpret a high-surrogate code point or a 
 low-surrogate code point as an abstract character.
 
 [1] http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf 
 
 Given that such guarantee is likely impractical, this presents a problem for 
 the above proposed language.

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012-03-26 Thread Steven Levithan


Norbert Lindenberg wrote:

The ugly world of web reality...

Actually, in V8, Firefox, Safari, and IE, /[\u{1}]/ seems to be the
same as /[\\u01{}]/ - it matches \\u01{}u01. In Opera, it doesn't seem to
match anything, but doesn't throw the specified SyntaxError either.


How did you test this. I get consistent results that agree with Erik in IE 
9, Firefox 11, Chrome 17, and Safari 5.1:


\\u01{}.match(/[\u{1}]/g); // ['u','0','1','{','}']
/\u{2}/g.test(uu); // true

Opera, as you said, returns null and false (tested v11.6 and v10.0).


Do we know of any applications actually relying on these bugs, seeing that
browsers don't agree on them?


Minus Opera, browsers do agree on them. Admirably so. And they aren't 
bugs--they're intentional breaks from ES for backcompat with earlier 
implementations that were themselves designed for backcompat with older 
non-ES regex behavior. The RegExp Match Web Reality proposal at 
http://wiki.ecmascript.org/doku.php?id=harmony:regexp_match_web_reality 
says to add them to the spec, and Allen has said the web reality proposal 
should be the top RegExp priority for ES6.


I'd easily believe it's safe enough to change /[\u{n..}]/ because of the 
four-part sequence involved in \u + { + n.. + } that is fairly unlikely to 
appear in that specific order in a character class. But I'd have a harder 
time believing /\u{n..}/ is safe to change. It would of course be great to 
have some real data on the risks/damage.



For string literals, I see that most implementations correctly throw a
SyntaxError when given \u{10}. The exception here is V8.


I'm sure it would be safer to allow \u{n..} for string literals even if this 
fortunate SyntaxError wasn't thrown. Users haven't been trained to think of 
escaped nonmetacharacters as safe for string literals to the extent that 
they have for regexes, and you can't programmatically generate such escapes 
so easily as when passing to the RegExp constructor.


-- Steven Levithan

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012-03-25 Thread Norbert Lindenberg

It's easy to provide this function, but in which situations would it be useful? 
In most cases that I can think of you're interested in far more constrained 
definitions of validity:
- what are valid ECMAScript identifiers?
- what are valid BCP 47 language tags?
- what are the characters allowed in a certain protocol?
- what are the characters that my browser can render?

Thanks,
Norbert


On Mar 24, 2012, at 12:12 , David Herman wrote:

 On Mar 23, 2012, at 11:45 AM, Roger Andrews wrote:
 
 Concerning UTF-16 surrogate pairs, how about a function like:
 String.isValid( str )
 to discover whether surrogates are used correctly in 'str'?
 
 Something like Array.isArray().
 
 No need for it to be a class method, since it only operates on strings. We 
 could simply have String.prototype.isValid(). Note that it would work for 
 primitive strings as well, thanks to JS's automatic promotion semantics.
 
 Dave
 

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012-03-25 Thread David Herman

On Mar 24, 2012, at 11:23 PM, Norbert Lindenberg wrote:

 On Mar 24, 2012, at 12:21 , David Herman wrote:
 
 I'm still getting up to speed on Unicode and JS string semantics, so I'm 
 guessing that I'm missing a reason why that wouldn't work... Presumably the 
 JS source of the regexp literal, as a sequence of UTF-16 code units, 
 represents the tetragram code points as surrogate pairs. Can we not 
 recognize surrogate pairs in character classes within a /u regexp and 
 interpret them as code points?
 
 With /u, that's exactly what happens. My first proposal was to make this 
 happen even without a new flag, i.e., make
 팆팇팈팉팊.match(/[팆-퍖]+/)
 work based on code points, and Steve is arguing against that because of 
 compatibility risk. My proposal also includes some transformations to keep 
 existing regular expressions working, and Steve correctly observes that if we 
 have a flag for code point mode, then the transformation is not needed - old 
 regular expressions would continue to work in code unit mode, while new 
 regular expressions with /u get code point treatment.

Excellent!

Thanks,
Dave

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012-03-25 Thread Roger Andrews


I use something like String.isValid functionality in a transcoder that
converts Strings to/from UTF-8, HTML Formdata (MIME type
application/x-www-form-urlencoded -- not the same as URI encoding!), and
Base64.

Admittedly these currently use 'encodeURI' to do the work, or it just drops
out naturally when considering UTF-8 sequences.

(I considered testing the regexp
/^(?:[\u-\uD7FF\uE000-\u]|[\uD800-\uDBFF][\uDC00-\uDFFF])*$/
against the input string.)

Maybe the function is too obscure for general use, although its presence 
does flag up the surrogate-pair issue to developers.


--
From: Norbert Lindenberg ecmascr...@norbertlindenberg.com


It's easy to provide this function, but in which situations would it be
useful? In most cases that I can think of you're interested in far more
constrained definitions of validity:
- what are valid ECMAScript identifiers?
- what are valid BCP 47 language tags?
- what are the characters allowed in a certain protocol?
- what are the characters that my browser can render?

Thanks,
Norbert


On Mar 24, 2012, at 12:12 , David Herman wrote:


On Mar 23, 2012, at 11:45 AM, Roger Andrews wrote:


Concerning UTF-16 surrogate pairs, how about a function like:
String.isValid( str )
to discover whether surrogates are used correctly in 'str'?

Something like Array.isArray().


No need for it to be a class method, since it only operates on strings.
We could simply have String.prototype.isValid(). Note that it would work
for primitive strings as well, thanks to JS's automatic promotion
semantics.

Dave




___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012-03-25 Thread Roger Andrews

Doesn't C/C++ allow non-BMP code points using \U in character 
literals.  The \U format expresses a full 32-bit code, which could be mapped 
internally to two 16-bit UTF-16 codes.


Then the programmer can describe exactly the required characters without 
caring about their coding in UTF-16 or whatever.


Could you use this to avoid complicated things in RegExps like 
[{\u\u}-{\u\u}], instead have things like 
[\U0001-\U0003] -- naturally expressing the characters of interest?


The same goes for String literals, where the programmer does not really care 
about the encoding, just specifying the character.


(Sorry if I've missed something in the prior discussion.)

--
From: Norbert Lindenberg
To: David Herman


On Mar 24, 2012, at 12:21 , David Herman wrote:

[snip]

As for whether the switch to code-point-based matching should be 
universal or require /u (an issue that your proposal leaves open), IMHO 
it's better to require /u since it avoids the need for transforming 
\u[\u-\u] to [{\u\u}-{\u\u}] and 
[\u-\u][\uDC00-\uDFFF] to [{\u\uDC00}-{\u\uDFFF}], and 
additionally avoids as least three potentially breaking changes (two of 
which are explicitly mentioned in your proposal):


I haven't completely understood this part of the discussion. Looking at 
/u as a little red switch (LRS), i.e., an opportunity to make judicious 
breaks with compatibility, could we not allow character classes with 
unescaped non-BMP code points, e.g.:


   js 팆팇팈팉팊.match(/[팆-퍖]+/u)
   [팆팇팈팉팊]

I'm still getting up to speed on Unicode and JS string semantics, so I'm 
guessing that I'm missing a reason why that wouldn't work... Presumably 
the JS source of the regexp literal, as a sequence of UTF-16 code units, 
represents the tetragram code points as surrogate pairs. Can we not 
recognize surrogate pairs in character classes within a /u regexp and 
interpret them as code points?


With /u, that's exactly what happens. My first proposal was to make this 
happen even without a new flag, i.e., make

팆팇팈팉팊.match(/[팆-퍖]+/)
work based on code points, and Steve is arguing against that because of 
compatibility risk. My proposal also includes some transformations to keep 
existing regular expressions working, and Steve correctly observes that if 
we have a flag for code point mode, then the transformation is not 
needed - old regular expressions would continue to work in code unit mode, 
while new regular expressions with /u get code point treatment.



___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012-03-25 Thread Roger Andrews

Just confirmed C/C++ do allow \U escaped characters for non-BMP code 
points in string literals.


Interesting page at:
http://publib.boulder.ibm.com/infocenter/comphelp/v7v91/topic/com.ibm.vacpp7a.doc/language/ref/clrc02unicode_standard.htm

So C/C++ has:
   \xNN   8-bit character (U+ - U+00FF)
   \u16-bit character
   \U   32-bit character

This naturally expresses any character, without worrying about the UTF-16 or 
whatever encoding.


--
From: Roger Andrews
To: Norbert Lindenberg


Doesn't C/C++ allow non-BMP code points using \U in character 
literals.  The \U format expresses a full 32-bit code, which could be 
mapped internally to two 16-bit UTF-16 codes.


Then the programmer can describe exactly the required characters without 
caring about their coding in UTF-16 or whatever.


Could you use this to avoid complicated things in RegExps like 
[{\u\u}-{\u\u}], instead have things like 
[\U0001-\U0003] -- naturally expressing the characters of 
interest?


The same goes for String literals, where the programmer does not really 
care about the encoding, just specifying the character.


(Sorry if I've missed something in the prior discussion.)

--
From: Norbert Lindenberg
To: David Herman


On Mar 24, 2012, at 12:21 , David Herman wrote:

[snip]

As for whether the switch to code-point-based matching should be 
universal or require /u (an issue that your proposal leaves open), IMHO 
it's better to require /u since it avoids the need for transforming 
\u[\u-\u] to [{\u\u}-{\u\u}] and 
[\u-\u][\uDC00-\uDFFF] to [{\u\uDC00}-{\u\uDFFF}], and 
additionally avoids as least three potentially breaking changes (two of 
which are explicitly mentioned in your proposal):


I haven't completely understood this part of the discussion. Looking at 
/u as a little red switch (LRS), i.e., an opportunity to make 
judicious breaks with compatibility, could we not allow character 
classes with unescaped non-BMP code points, e.g.:


   js 팆팇팈팉팊.match(/[팆-퍖]+/u)
   [팆팇팈팉팊]

I'm still getting up to speed on Unicode and JS string semantics, so I'm 
guessing that I'm missing a reason why that wouldn't work... Presumably 
the JS source of the regexp literal, as a sequence of UTF-16 code units, 
represents the tetragram code points as surrogate pairs. Can we not 
recognize surrogate pairs in character classes within a /u regexp and 
interpret them as code points?


With /u, that's exactly what happens. My first proposal was to make this 
happen even without a new flag, i.e., make

팆팇팈팉팊.match(/[팆-퍖]+/)
work based on code points, and Steve is arguing against that because of 
compatibility risk. My proposal also includes some transformations to 
keep existing regular expressions working, and Steve correctly observes 
that if we have a flag for code point mode, then the transformation is 
not needed - old regular expressions would continue to work in code unit 
mode, while new regular expressions with /u get code point treatment.



___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012-03-24 Thread David Herman

On Mar 23, 2012, at 11:45 AM, Roger Andrews wrote:

 Concerning UTF-16 surrogate pairs, how about a function like:
  String.isValid( str )
 to discover whether surrogates are used correctly in 'str'?
 
 Something like Array.isArray().

No need for it to be a class method, since it only operates on strings. We 
could simply have String.prototype.isValid(). Note that it would work for 
primitive strings as well, thanks to JS's automatic promotion semantics.

Dave

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012-03-24 Thread David Herman

On Mar 23, 2012, at 6:30 AM, Steven Levithan wrote:

 I've been wondering whether it might be best for the /u flag to do three 
 things at once, making it an all-around support Unicode better flag:

+all my internet points

Now you're talking!!

 1. Switches from code unit to code point mode. /./gu matches any Unicode code 
 point, among other benefits outlined by Norbert.
 
 2. Makes \d\D\w\W\b\B match Unicode decimal digits and word characters. 
 [0-9], [A-Za-z0-9_], and lookaround provide fallbacks if you want to match 
 ASCII characters only while using /u.
 
 3. [New proposal] Makes /i use Unicode casefolding rules. 
 /ΣΤΙΓΜΑΣ/iu.test(στιγμας) == true.

This is really exciting.

 As for whether the switch to code-point-based matching should be universal or 
 require /u (an issue that your proposal leaves open), IMHO it's better to 
 require /u since it avoids the need for transforming \u[\u-\u] to 
 [{\u\u}-{\u\u}] and [\u-\u][\uDC00-\uDFFF] to 
 [{\u\uDC00}-{\u\uDFFF}], and additionally avoids as least three 
 potentially breaking changes (two of which are explicitly mentioned in your 
 proposal):

I haven't completely understood this part of the discussion. Looking at /u as a 
little red switch (LRS), i.e., an opportunity to make judicious breaks with 
compatibility, could we not allow character classes with unescaped non-BMP code 
points, e.g.:

js 팆팇팈팉팊.match(/[팆-퍖]+/u)
[팆팇팈팉팊]

I'm still getting up to speed on Unicode and JS string semantics, so I'm 
guessing that I'm missing a reason why that wouldn't work... Presumably the JS 
source, as a sequence of UTF-16 code units, represents the tetragram code 
points as surrogate pairs. Can we not recognize surrogate pairs in character 
classes within a /u regexp and interpret them as code points?

Dave

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012-03-24 Thread David Herman

 Presumably the JS source, as a sequence of UTF-16 code units, represents the 
 tetragram code points as surrogate pairs.

Clarification: the JS source *of the regexp literal*.

Dave

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012-03-24 Thread Wes Garland

On 24 March 2012 17:22, David Herman dher...@mozilla.com wrote:

 I'm not 100% clear on this point yet, but e.g. the SourceCharacter
 production in Annex A.1 is described as any Unicode code unit.


Ugh, IMHO, that's wrong, and should be any Unicode code point.  (let the
flames begin?)

 The underlying transport format should not be a concern for the JS lexer.


 eval


Eval is a red herring: its input is defined as the contents of the given
String.  So, we come full-circle back to what's in a String?.   I'm still
partial to Brendan's BRS idea, because at least it fixes everything all at
once.

Wes

-- 
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012-03-24 Thread Norbert Lindenberg


On Mar 23, 2012, at 6:30 , Steven Levithan wrote:

 Norbert Lindenberg wrote:
 
 I've updated the proposal based on the feedback received so far. Changes
 are listed in the Updates section.
 http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/
 
 Cool.
 
 From the proposal's Updates section:
 
 Indicated that u may not be the actual character for the flag for code
 point mode in regular expressions, as a u flag has already been proposed
 for Unicode-aware digit and word character matching.
 
 I've been wondering whether it might be best for the /u flag to do three 
 things at once, making it an all-around support Unicode better flag:
 
 1. Switches from code unit to code point mode. /./gu matches any Unicode code 
 point, among other benefits outlined by Norbert.
 
 2. Makes \d\D\w\W\b\B match Unicode decimal digits and word characters. 
 [0-9], [A-Za-z0-9_], and lookaround provide fallbacks if you want to match 
 ASCII characters only while using /u.

One concern: I think code point based matching should be the default for regex 
literals within modules (where we know the code is written for Harmony). Does 
it make sense to also interpret \d\D\w\W\b\B as full Unicode sets for such 
literals?

In the other direction it's clear that using /u for \d\D\w\W\b\B has to imply 
code point mode.

 3. [New proposal] Makes /i use Unicode casefolding rules. 
 /ΣΤΙΓΜΑΣ/iu.test(στιγμας) == true.

We probably should review the complete Unicode Technical Standard #18, Unicode 
Regular Expressions, and see how we can upgrade RegExp for better Unicode 
support. Maybe on a separate thread...

 Item number 3 is inspired by but different than Java's lowercase u flag for 
 Unicode casefolding. In Java, flag u itself enables Unicode casefolding and 
 does not need to be paired with flag i (which is equivalent to ES's /i).
 
 As an aside, merging these three things would likely lead to /u seeing 
 widespread use when dealing with anything more than ASCII, at least in 
 environments where you don't have to worry about backcompat. This would help 
 developers avoid stumbling on code unit issues in the small minority of cases 
 where non-BMP characters are used or encountered. If /u's only purpose was to 
 switch to code point mode, most likely it would be used *far* less often, and 
 more developers would continue to get bitten by code-unit-based processing.

Good thinking :-)

 As for whether the switch to code-point-based matching should be universal or 
 require /u (an issue that your proposal leaves open), IMHO it's better to 
 require /u since it avoids the need for transforming \u[\u-\u] to 
 [{\u\u}-{\u\u}] and [\u-\u][\uDC00-\uDFFF] to 
 [{\u\uDC00}-{\u\uDFFF}], and additionally avoids as least three 
 potentially breaking changes (two of which are explicitly mentioned in your 
 proposal):
 
 1. [S]ome applications might have processed gunk with regular expressions 
 where neither the 'characters' in the patterns nor the input to be matched 
 are text.
 
 2. s.match(/^.$/)[0].length can now be 2.
 I'll add, /.{3}/.exec(s)[0].length can now be anywhere between 3 and 6.
 
 3. /./g.exec(s) can now increment the regex's lastIndex by 2.
 
 -- Steven Levithan

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012-03-24 Thread Norbert Lindenberg


On Mar 23, 2012, at 7:12 , Lasse Reichstein wrote:

 On Fri, Mar 23, 2012 at 2:30 PM, Steven Levithan
 steves_l...@hotmail.com wrote:
 I've been wondering whether it might be best for the /u flag to do three
 things at once, making it an all-around support Unicode better flag:
 
 ...
 
 3. [New proposal] Makes /i use Unicode casefolding rules.
 
 Yey, I'm for it :)
 Especially if it means dropping the rather naïve canonicalize function
 that can't canonicalize an ASCII character with a non-ASCII character.
 
 /ΣΤΙΓΜΑΣ/iu.test(στιγμας) == true.
 
 I think a compliant implementation should (read: ought to) already get
 that example, since στιγμας.toUpperCase() == ΣΤΙΓΜΑΣ.toUpperCase()
 in the browsers I have checked, and the ignore-case canonicalization
 is based on toUpperCase. Alas, most of the implementations miss it
 anyway.

According to the ES5 spec, /ΣΤΙΓΜΑΣ/i.test(στιγμας) must be true indeed. 
Chrome and Node (i.e., V8) and IE get this right; Safari, Firefox, and Opera 
don't.

Note that toUpperCase allows mappings from 1 to multiple code units, while 
RegExp canonicalization in ES5 doesn't, so /SS/i.test(ß) === false even 
though SS.toUpperCase() === ß.toUpperCase().

Norbert

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012-03-24 Thread Norbert Lindenberg

Thanks for the detailed comments! Replies below.

Norbert


On Mar 23, 2012, at 9:46 , Phillips, Addison wrote:

 Comments follow.
 
 1. Definition of string. You say:
 
 --
 However,
ECMAScript does not place any restrictions or requirements on the sequence
of code units in a String value, so it may be ill-formed when interpreted
as a UTF-16 code unit sequence.
 --
 
 I know what you mean, but others might not. Perhaps:
 
 --
 However, ECMAScript does not place any restrictions or requirements on the 
 sequence of code units in a String value, so the sequence of code units may 
 contain code units that are not valid in Unicode or sequences that do not 
 represent Unicode code points (such as unpaired surrogates).
 --

I can add a note that ill-formed here means containing unpaired surrogates. If 
I read chapter 3 of the Unicode Standard correctly, there's no other way for 
UTF-16 to be ill-formed. UTF-16 code units by themselves cannot be invalid - 
any 16-bit value can occur in a well-formed UTF-16 string.

 2. In this section, I would define string after code unit and code point. I 
 would also include a definition of surrogates/surrogate pairs.

Makes sense.

 3. Under text interpretation you say:
 
 --
 For compatibility with existing applications, it
  has to allow surrogate code points (code points between U+D800 and U+DFFF 
 which
  can never represent characters).
 --
 
 This would (see above) benefit from having a definition in place. As noted, 
 this is slightly incomplete, since surrogate code units are used to form 
 supplementary characters.

The text is about surrogate code points, not about surrogate code units.

 4. 0xFFFE and 0x are non-characters in Unicode. I do think you do the 
 right thing here. It's just a nit that you never note this ;-).
 
 5. Editorial unnecessary ;-):
 
 --
 This transformation is rather ugly, but I’m afraid it’s the price ECMAScript
  has to pay for being 12 years late in supporting supplementary characters.
 --
 
 6. Under 'details' you suggest a number of renamings. Are these strictly 
 necessary? The term 'character' could be taken to mean 'code point' instead, 
 with an explanatory note.

Unfortunately, the term character is poisoned in ES5 by a redefinition as 
code unit (chapter 6). For ES6, I'd like the spec to be really clear where it 
means code units and where it means code points. Maybe we can then reintroduce 
character in ES7...

 7. Skipping down a lot, to section 6 source text, you propose:
 
 --
 The text is expected to have been normalised
to Unicode Normalization Form C (Canonical Decomposition, followed by 
 Canonical
Composition), as described in Unicode Standard Annex #15.
 --
 
 I think this should be removed or modified.

This sentence is essentially copied from ES5 (with corrected references), and 
as I copied it, I made a note to myself that we need to discuss normalization, 
just not as part of this proposal...

 Automatic application of NFC is not always desirable, as it can affect 
 presentation or processing. Perhaps:
 
 --
 Normalization of the text to Unicode Normalization Form C (Canonical 
 Decomposition, followed by Canonical Composition), as described in Unicode 
 Standard Annex #15, is recommended when transcoding from another character 
 encoding.
 --
 
 8. In 7.6 Identifier Names and Identifiers you don't actually forbid 
 unpaired surrogates or non-characters in the text (Identifier_Part:: does 
 this by implication). Perhaps state it? Also, ZWJ and ZWNJ are permitted as 
 the last character in an identifier.

I can add a note about surrogate code points and non-characters, but, as you 
say, they are already ruled out because they can't have the required Unicode 
properties ID_Start or ID_Continue.

The use of ZWJ and ZWNJ is unchanged from ES5. UAX 31 has much stricter rules 
on where they would be allowed, but I'm not sure we have a strong case for 
changing the rules in ECMAScript.
http://www.unicode.org/reports/tr31/tr31-9.html#Layout_and_Format_Control_Characters

 9. 15.5.4.6: you say (a nonnegative integer less than 0x10), whereas 
 it should say: (a nonnegative integer less than or equal to 0x10)

Will fix.

 10. In the section on what about utf-32, you say:  and the code points 
 start at positions 1, 2, 3.. Of course this should be ... and the code 
 points start at positions 0, 1, 2.

Of course.

 Thanks for this proposal!
 
 Addison
 
 -Original Message-
 From: Norbert Lindenberg [mailto:ecmascr...@norbertlindenberg.com]
 Sent: Thursday, March 22, 2012 10:14 PM
 To: es-discuss@mozilla.org
 Subject: Re: Full Unicode based on UTF-16 proposal
 
 I've updated the proposal based on the feedback received so far. Changes are
 listed in the Updates section.
 http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/
 
 Norbert
 
 
 On Mar 16, 2012, at 0:18 , Norbert Lindenberg wrote:
 
 Based on my prioritization of goals for support for full Unicode

Re: Full Unicode based on UTF-16 proposal

2012-03-23 Thread Steven Levithan


Norbert Lindenberg wrote:


I've updated the proposal based on the feedback received so far. Changes
are listed in the Updates section.
http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/


Cool.

From the proposal's Updates section:


Indicated that u may not be the actual character for the flag for code
point mode in regular expressions, as a u flag has already been proposed
for Unicode-aware digit and word character matching.


I've been wondering whether it might be best for the /u flag to do three 
things at once, making it an all-around support Unicode better flag:


1. Switches from code unit to code point mode. /./gu matches any Unicode 
code point, among other benefits outlined by Norbert.


2. Makes \d\D\w\W\b\B match Unicode decimal digits and word characters. 
[0-9], [A-Za-z0-9_], and lookaround provide fallbacks if you want to match 
ASCII characters only while using /u.


3. [New proposal] Makes /i use Unicode casefolding rules. 
/ΣΤΙΓΜΑΣ/iu.test(στιγμας) == true.


Item number 3 is inspired by but different than Java's lowercase u flag for 
Unicode casefolding. In Java, flag u itself enables Unicode casefolding and 
does not need to be paired with flag i (which is equivalent to ES's /i).


As an aside, merging these three things would likely lead to /u seeing 
widespread use when dealing with anything more than ASCII, at least in 
environments where you don't have to worry about backcompat. This would help 
developers avoid stumbling on code unit issues in the small minority of 
cases where non-BMP characters are used or encountered. If /u's only purpose 
was to switch to code point mode, most likely it would be used *far* less 
often, and more developers would continue to get bitten by code-unit-based 
processing.


As for whether the switch to code-point-based matching should be universal 
or require /u (an issue that your proposal leaves open), IMHO it's better to 
require /u since it avoids the need for transforming \u[\u-\u] 
to [{\u\u}-{\u\u}] and [\u-\u][\uDC00-\uDFFF] to 
[{\u\uDC00}-{\u\uDFFF}], and additionally avoids as least three 
potentially breaking changes (two of which are explicitly mentioned in your 
proposal):


1. [S]ome applications might have processed gunk with regular expressions 
where neither the 'characters' in the patterns nor the input to be matched 
are text.


2. s.match(/^.$/)[0].length can now be 2.
I'll add, /.{3}/.exec(s)[0].length can now be anywhere between 3 and 6.

3. /./g.exec(s) can now increment the regex's lastIndex by 2.

-- Steven Levithan


___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012-03-23 Thread Lasse Reichstein

On Fri, Mar 23, 2012 at 2:30 PM, Steven Levithan
steves_l...@hotmail.com wrote:
 I've been wondering whether it might be best for the /u flag to do three
 things at once, making it an all-around support Unicode better flag:

...

 3. [New proposal] Makes /i use Unicode casefolding rules.

Yey, I'm for it :)
Especially if it means dropping the rather naïve canonicalize function
that can't canonicalize an ASCII character with a non-ASCII character.

 /ΣΤΙΓΜΑΣ/iu.test(στιγμας) == true.

I think a compliant implementation should (read: ought to) already get
that example, since στιγμας.toUpperCase() == ΣΤΙΓΜΑΣ.toUpperCase()
in the browsers I have checked, and the ignore-case canonicalization
is based on toUpperCase. Alas, most of the implementations miss it
anyway.

/L
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

RE: Full Unicode based on UTF-16 proposal

2012-03-23 Thread Phillips, Addison

Comments follow.

1. Definition of string. You say:

--
However,
ECMAScript does not place any restrictions or requirements on the sequence
of code units in a String value, so it may be ill-formed when interpreted
as a UTF-16 code unit sequence.
--

I know what you mean, but others might not. Perhaps:

--
However, ECMAScript does not place any restrictions or requirements on the 
sequence of code units in a String value, so the sequence of code units may 
contain code units that are not valid in Unicode or sequences that do not 
represent Unicode code points (such as unpaired surrogates).
--

2. In this section, I would define string after code unit and code point. I 
would also include a definition of surrogates/surrogate pairs.

3. Under text interpretation you say:

--
For compatibility with existing applications, it
  has to allow surrogate code points (code points between U+D800 and U+DFFF 
which
  can never represent characters).
--

This would (see above) benefit from having a definition in place. As noted, 
this is slightly incomplete, since surrogate code units are used to form 
supplementary characters. Perhaps:

--
For compatibility with existing applications, it has to allow surrogate code 
points (code points between U+D800 and U+DFFF which do not individually 
represent characters).
--

4. 0xFFFE and 0x are non-characters in Unicode. I do think you do the right 
thing here. It's just a nit that you never note this ;-).

5. Editorial unnecessary ;-):

--
This transformation is rather ugly, but I’m afraid it’s the price ECMAScript
  has to pay for being 12 years late in supporting supplementary characters.
--

6. Under 'details' you suggest a number of renamings. Are these strictly 
necessary? The term 'character' could be taken to mean 'code point' instead, 
with an explanatory note.

7. Skipping down a lot, to section 6 source text, you propose:

--
The text is expected to have been normalised
to Unicode Normalization Form C (Canonical Decomposition, followed by 
Canonical
Composition), as described in Unicode Standard Annex #15.
--

I think this should be removed or modified. Automatic application of NFC is not 
always desirable, as it can affect presentation or processing. Perhaps:

--
Normalization of the text to Unicode Normalization Form C (Canonical 
Decomposition, followed by Canonical Composition), as described in Unicode 
Standard Annex #15, is recommended when transcoding from another character 
encoding.
--

8. In 7.6 Identifier Names and Identifiers you don't actually forbid unpaired 
surrogates or non-characters in the text (Identifier_Part:: does this by 
implication). Perhaps state it? Also, ZWJ and ZWNJ are permitted as the last 
character in an identifier.

9. 15.5.4.6: you say (a nonnegative integer less than 0x10), whereas it 
should say: (a nonnegative integer less than or equal to 0x10)

10. In the section on what about utf-32, you say:  and the code points start 
at positions 1, 2, 3.. Of course this should be ... and the code points start 
at positions 0, 1, 2.

Thanks for this proposal!

Addison

 -Original Message-
 From: Norbert Lindenberg [mailto:ecmascr...@norbertlindenberg.com]
 Sent: Thursday, March 22, 2012 10:14 PM
 To: es-discuss@mozilla.org
 Subject: Re: Full Unicode based on UTF-16 proposal
 
 I've updated the proposal based on the feedback received so far. Changes are
 listed in the Updates section.
 http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/
 
 Norbert
 
 
 On Mar 16, 2012, at 0:18 , Norbert Lindenberg wrote:
 
  Based on my prioritization of goals for support for full Unicode in 
  ECMAScript
 [1], I've put together a proposal for supporting the full Unicode character 
 set
 based on the existing representation of text in ECMAScript using UTF-16 code
 unit sequences:
  http://norbertlindenberg.com/2012/03/ecmascript-supplementary-
 characters/index.html
 
  The detailed proposed spec changes serve to get a good idea of the scope of
 the changes, but will need some polishing.
 
  Comments?
 
  Thanks,
  Norbert
 
  [1] https://mail.mozilla.org/pipermail/es-discuss/2012-February/020721.html
 
 

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012-03-23 Thread Roger Andrews


Concerning UTF-16 surrogate pairs, how about a function like:
  String.isValid( str )
to discover whether surrogates are used correctly in 'str'?

Something like Array.isArray().

Nb.  Already encodeURI throws an URIError exception if 'str' is not a 
well-formed UTF-16 string.


-

1. Definition of string. You say:

--
However,
   ECMAScript does not place any restrictions or requirements on the
   sequence of code units in a String value, so it may be ill-formed when
   interpreted as a UTF-16 code unit sequence.
--

I know what you mean, but others might not. Perhaps:

--
However, ECMAScript does not place any restrictions or requirements on the
sequence of code units in a String value, so the sequence of code units
may contain code units that are not valid in Unicode or sequences that do
not represent Unicode code points (such as unpaired surrogates).
--



___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012-03-22 Thread Norbert Lindenberg

I've updated the proposal based on the feedback received so far. Changes are 
listed in the Updates section.
http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/

Norbert


On Mar 16, 2012, at 0:18 , Norbert Lindenberg wrote:

 Based on my prioritization of goals for support for full Unicode in 
 ECMAScript [1], I've put together a proposal for supporting the full Unicode 
 character set based on the existing representation of text in ECMAScript 
 using UTF-16 code unit sequences:
 http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/index.html
 
 The detailed proposed spec changes serve to get a good idea of the scope of 
 the changes, but will need some polishing.
 
 Comments?
 
 Thanks,
 Norbert
 
 [1] https://mail.mozilla.org/pipermail/es-discuss/2012-February/020721.html
 

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012-03-19 Thread Steven L.


Steven Levithan wrote:
\w with Unicode should match [\p{L}\{Nd}_]. The best way to go for 
[[:alnum:]], for compatibility reasons, would probably be 
[\p{Ll}\p{Lu}\p{Lt}\p{Nd}]. This difference could be argued as a positive 
(if you like that exact set) or a negative (many users will think it's 
equivalent to \w with Unicode even though it isn't).


Although some regex libraries indeed implement the above, I've just looked 
over UTS#18 Annex C [1], which requires that \w be equivalent to:


[\p{Alphabetic}\p{M}\p{Nd}\p{Pc}]

Note that \p{Alphabetic} should include more than just \p{L}. I'm not clear 
on whether the differences from \p{L} are fully covered by the inclusion of 
\p{M} in the above character class. I'm sure there are plenty of people here 
with greater Unicode expertise than me who could clarify, though.


-- Steven Levithan

[1]: http://unicode.org/reports/tr18/#Compatibility_Properties

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012-03-19 Thread Steven L.

Java SE 7 apparently added flag (?U) to do the same thing as Python's (?u). 
The new flag also affects Java's POSIX character class definitions such as 
\p{Alnum}.


Note the difference in casing, and also that Java's (?U)\w follows UTS#18, 
unlike Python's (?u)\w. Java has long supported a lowercase (?u) flag for 
Unicode-aware case folding.


-- Steven Levithan


-Original Message- 
From: Steven L.

Sent: Monday, March 19, 2012 12:21 PM
To: Erik Corry
Cc: es-discuss@mozilla.org
Subject: Re: Full Unicode based on UTF-16 proposal

Steven Levithan wrote:
\w with Unicode should match [\p{L}\{Nd}_]. The best way to go for 
[[:alnum:]], for compatibility reasons, would probably be 
[\p{Ll}\p{Lu}\p{Lt}\p{Nd}]. This difference could be argued as a positive 
(if you like that exact set) or a negative (many users will think it's 
equivalent to \w with Unicode even though it isn't).


Although some regex libraries indeed implement the above, I've just looked
over UTS#18 Annex C [1], which requires that \w be equivalent to:

[\p{Alphabetic}\p{M}\p{Nd}\p{Pc}]

Note that \p{Alphabetic} should include more than just \p{L}. I'm not clear
on whether the differences from \p{L} are fully covered by the inclusion of
\p{M} in the above character class. I'm sure there are plenty of people here
with greater Unicode expertise than me who could clarify, though.

-- Steven Levithan

[1]: http://unicode.org/reports/tr18/#Compatibility_Properties

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss 


___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012-03-18 Thread Steven L.


Steven Levithan wrote:

* \s == [\x09-\x0D] -- Java, PCRE, Ruby, Python (default).
* \s == [\x09–\x0D\p{Z}] -- ES-current, .NET, Perl, Python (with (?u)).


Oops. My ASCII-only version of \s is obviously missing space \x20 and 
no-break space \xAO (which are included in Unicode's \p{Z}).


Erik Corry wrote:

Steven Levithan wrote:

[:alnum:] in Perl, PCRE, Ruby, Tcl, POSIX/GNU BRE/ERE, etc. matches only
[A-Za-z0-9]. Making it Unicode-based in ES would be confusing.


This would be pretty useless and is not true in perl.  I tried the 
following:


perl -e use utf8; print 'æ' =~ /[[:alnum:]]/ . \\n\;

and it prints 1, indicating a match.


***Updating my mental notes*** Roger that. Online docs (including the 
Perl-specific page you linked to earlier) typically list [:alnum:] as 
[A-Za-z0-9], but I've just done some quick testing and it seems that regex 
packages supporting [:alnum:] give it at least three different meanings:


* [A-Za-z0-9]
* [\p{Ll}\p{Lu}\p{Lt}\p{Nd}]
* [\p{Ll}\p{Lu}\p{Lt}\p{Nd}\p{Nl}]

Note that although Java doesn't support POSIX character class syntax, it too 
supports alnum via \p{Alnum}. Java's alnum matches only [A-Za-z0-9].


Anyway, this is probably all moot, unless someone wants to officially 
propose POSIX character classes for ES RegExp. ...In which case I'll be 
happy to state about a half-dozen reasons to not do so. :)


Erik Corry wrote:

OK, I'm convinced that /u should make \d, \b and \w Unicode aware.


w00t!

--Steven Levithan


___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012-03-18 Thread Erik Corry

2012/3/18 Steven L. steves_l...@hotmail.com:
 Anyway, this is probably all moot, unless someone wants to officially
 propose POSIX character classes for ES RegExp. ...In which case I'll be
 happy to state about a half-dozen reasons to not do so. :)

Please do, they seem quite sensible to me.

In fact \w with Unicode support seems very similar to [:alnum:] to me.
 If this one is useful are there not other Unicode categories that
would be useful?

-- 
Erik Corry
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012-03-18 Thread Steven L.


Erik Corry wrote:

Steven Levithan wrote:

Anyway, this is probably all moot, unless someone wants to officially
propose POSIX character classes for ES RegExp. ...In which case I'll be
happy to state about a half-dozen reasons to not do so. :)


Please do, they seem quite sensible to me.


My main objections are due to the POSIX character class syntax itself, and 
my preference for introducing Unicode categories using \p{..} instead. But 
to get down a little more detail...


* They're backward incompatible. /[[:name:]]/ is currently equivalent to 
/[\[:aemn]\]/ in web-reality. Granted, this probably won't be a big deal for 
existing code, but because they're not currently an error, their use could 
cause latent bugs in old browsers that don't support them and treat them as 
part of a character class's set.


* They work inside of bracket expressions only. This is clumsy and 
needlessly confusing. [:alnum:] outside of a bracket expression would 
probably have to continue to be equivalent to [:almnu], which would lead to 
at least occasional developer frustration and bugs.


* Since the exact characters they match differs between regex libraries 
(beyond just Unicode version variation), they would contribute to the 
existing landscape of regex features that seem to be portable but actually 
work slightly differently in different places. We need less of this.


* They are either rarely useful or only minor conveniences over existing 
shorthands, explicit character classes, or Unicode categories that could be 
matched using \p{..} in more standardized fashion.


* Other implementations, at least, do not allow them to be negated on their 
own, unlike \p{..} (via \P{..} or \p{^..}). They can be negated by using 
them in negated bracket expressions, but that may negate more than you want.


* If ES ever adopts .NET/XPath-style character class subtraction or 
Java-style character class intersection (the latter was on the cards for 
ES4), their syntax would become even more confusing.


* Bonus pompous bullet point: IMO, there are more useful and important new 
RegExp features to focus on, including support for Unicode categories 
(which, IMO, are regex's new and improved version of POSIX character 
classes). My personal wishlist would probably include at least 20 new regex 
features above POSIX character classes, even if they were introduced using 
the \p{..} syntax (which is how Java included them).


* Bonus nitpick: The name of the syntax itself causes confusion. POSIX calls 
them character classes, and calls their container a bracket expression. 
JavaScripters already call the container a character class. (Not an 
objection, per se. Presumably we could call them something like POSIX 
shorthands to avoid confusion.)


I'd have no actual objections to adding them using the \p{Name} syntax (as 
Java does), especially if there is demand for them among regex power-users 
(you're the first person who I've seen strongly advocate for them). However, 
I'd still have concerns about exactly which names are added, exactly what 
they match, and their compatibility with other regex flavors.



In fact \w with Unicode support seems very similar to [:alnum:] to me.
 If this one is useful are there not other Unicode categories that
would be useful?


\w with Unicode should match [\p{L}\{Nd}_]. The best way to go for 
[[:alnum:]], for compatibility reasons, would probably be 
[\p{Ll}\p{Lu}\p{Lt}\p{Nd}]. This difference could be argued as a positive 
(if you like that exact set) or a negative (many users will think it's 
equivalent to \w with Unicode even though it isn't).


As you said, though, Unicode categories are indeed quite useful. Unicode 
scripts, too. I'd advocate for them alongside you. Because of how useful 
they are, I've even made them usable via my XRegExp JavaScript library (see 
http://git.io/xregexp ). That lib has a relatively small but enthusiastic 
user base and is seeing increasing use in server-side JS, where the overhead 
of loading long Unicode code point ranges doesn't matter as much. But, so 
long as a /u flag is added for switching \w\b\d to Unicode-mode, I'd argue 
that even Unicode categories and scripts are less important than various 
other features I've mentioned recently on es-discuss, including named 
capture and atomic groups.


-- Steven Levithan


___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012-03-17 Thread Steven L.


Eric Corry wrote:

However I think we probably do want the /u modifier on regexps to
control the new backward-incompatible behaviour.  There may be some
way to relax this for regexp literals in opted in Harmony code, but
for new RegExp(...) and for other string literals I think there are
rather too many inconsistencies with the old behaviour.


Disagree with adding /u for this purpose and disagree with breaking backward 
compatibility to let `/./.exec(s)[0].length == 2`. Instead, if this is 
deemed an important enough issue, there are two ways to match any Unicode 
grapheme that match existing regex library precedent:


From Perl and PCRE:

\X

From Perl, PCRE, .NET, Java, XML Schema, and ICU (among others):

\P{M}\p{M}*

Obviously \X is prettier, but because it's fairly rare for people to care 
about this, IMO the more widely compatible solution that uses Unicode 
categories is Good Enough if Unicode category syntax is on the table for 
ES6.


Norbert Lindenberg wrote:

\u[\u-\u] is interpreted as [\u\u-\u\u]
[\u-\u][\u-\u] is interpreted as 
[\u\u-\u\u]
This transformation is rather ugly, but I’m afraid it’s the price 
ECMAScript

has to pay for being 12 years late in supporting supplementary characters.


Yikes! -1! This is unnecessary if the handling of \u is unmodified and 
support for \u{h..} and/or \x{h..} is added (the latter is the syntax from 
Perl and PCRE). Some people will want a way to match arbitrary Unicode code 
points rather than graphemes anyway, so leaving \u alone lets that use 
case continue working. This would still allow modifying the handling of 
literal astral/supplementary characters in RegExps. If it can be handled 
sensibly, I'm all for treating literal characters in RegExps as discrete 
graphemes rather than splitting them into surrogate pairs.


--Steven Levithan

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012/3/17 Steven L. steves_l...@hotmail.com:
 Eric Corry wrote:

 However I think we probably do want the /u modifier on regexps to
 control the new backward-incompatible behaviour.  There may be some
 way to relax this for regexp literals in opted in Harmony code, but
 for new RegExp(...) and for other string literals I think there are
 rather too many inconsistencies with the old behaviour.


 Disagree with adding /u for this purpose and disagree with breaking backward
 compatibility to let `/./.exec(s)[0].length == 2`.

Care to enlighten us with any thinking behind this disagreeing?

 Instead, if this is
 deemed an important enough issue, there are two ways to match any Unicode
 grapheme that match existing regex library precedent:

 From Perl and PCRE:

 \X

This doesn't work inside [].  Were you envisioning the same restriction in JS?

Also it matches a grapheme cluster, which is may be useful but is
completely different to what the dot does.

 From Perl, PCRE, .NET, Java, XML Schema, and ICU (among others):

 \P{M}\p{M}*

 Obviously \X is prettier, but because it's fairly rare for people to care
 about this, IMO the more widely compatible solution that uses Unicode
 categories is Good Enough if Unicode category syntax is on the table for
 ES6.

 Norbert Lindenberg wrote:

 \u[\u-\u] is interpreted as [\u\u-\u\u]

Norbert, this just happens automatically if unmatched surrogates are
just treated as if they were normal code units.

 [\u-\u][\u-\u] is interpreted as
 [\u\u-\u\u]

Norbert, this will have different semantics to the current
implementations unless the second range is the full trail surrogate
range.

I agree with Steven that these two cases should just be left alone,
which means they will continue to work the way they have until now.

 Some people will want a way to match arbitrary Unicode code
 points rather than graphemes anyway, so leaving \u alone lets that use
 case continue working. This would still allow modifying the handling of
 literal astral/supplementary characters in RegExps. If it can be handled
 sensibly, I'm all for treating literal characters in RegExps as discrete
 graphemes rather than splitting them into surrogate pairs.

You seem to be confusing graphemes and unicode code points.  Here is
the same text 3 times:

Four UTF-16 code units:

0x0020 0xD800 0xDF30 0x0308

Three Unicode code points:

0x20 0x10330 0x308

Two Graphemes

  ¨  -- This is an attempt to show a Gothic Ahsa with an umlaut.
My mail program probably screwed it up.

The proposal you are responding to is all about adding Unicode code
point handling to regexps.  It is not about adding grapheme support,
which is a rather different issue.

-- 
Erik Corry
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012-03-17 Thread Steven L.


Eric Corry wrote:
Disagree with adding /u for this purpose and disagree with breaking 
backward

compatibility to let `/./.exec(s)[0].length == 2`.


Care to enlighten us with any thinking behind this disagreeing?


Sorry for the rushed and overly ebullient message. I disagreed with /u for 
switching from code unit to code point mode because in the moment I didn't 
think a code point mode necessary or particularly beneficial. Upon further 
reflection, I rushed into this opinion and will be more closely examining 
the related issues.


I further objected because I think the /u flag would be better used as a 
ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on 
Python's re.UNICODE or (?u) flag, which does the same thing except that it 
also covers \s (which is already Unicode-based in ES). Therefore, I think 
that if a flag is added that only switches from code unit to code point 
mode, it should not be u. Presumably, flag /u could simultaneously affect 
\d\w\b and switch to code point mode. I haven't yet thought enough about 
combining these two proposals to hold a strong opinion on the matter.



there are two ways to match any Unicode
grapheme that match existing regex library precedent:

From Perl and PCRE:
\X


This doesn't work inside [].  Were you envisioning the same restriction in 
JS?


Also it matches a grapheme cluster, which is may be useful but is
completely different to what the dot does.


You are of course correct. And yes, I was envisioning the same restriction 
within character classes. But I'm not a strong proponent of \X, especially 
if support for Unicode categories is added.



I agree with Steven that these two cases should just be left alone,
which means they will continue to work the way they have until now.


Glad to hear it.


You seem to be confusing graphemes and unicode code points.
[...]
The proposal you are responding to is all about adding Unicode code
point handling to regexps.  It is not about adding grapheme support,
which is a rather different issue.


Indeed. My response was rushed and poorly formed. My apologies.

--Steven Levithan

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012-03-17 Thread Norbert Lindenberg

Steven, sorry, I wasn't aware of your proposal for /u when I inserted the note 
on this flag into my proposal. My proposal was inspired by the use of /u in 
PHP, where it switches from byte mode to UTF-8 mode. We'll have to see whether 
it makes sense to combine the two under one flag or use two - fortunately, 
Unicode still has a few other characters.

Norbert


On Mar 17, 2012, at 11:22 , Steven L. wrote:

 Eric Corry wrote:
 Disagree with adding /u for this purpose and disagree with breaking backward
 compatibility to let `/./.exec(s)[0].length == 2`.
 
 Care to enlighten us with any thinking behind this disagreeing?
 
 Sorry for the rushed and overly ebullient message. I disagreed with /u for 
 switching from code unit to code point mode because in the moment I didn't 
 think a code point mode necessary or particularly beneficial. Upon further 
 reflection, I rushed into this opinion and will be more closely examining the 
 related issues.
 
 I further objected because I think the /u flag would be better used as a 
 ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on 
 Python's re.UNICODE or (?u) flag, which does the same thing except that it 
 also covers \s (which is already Unicode-based in ES). Therefore, I think 
 that if a flag is added that only switches from code unit to code point mode, 
 it should not be u. Presumably, flag /u could simultaneously affect \d\w\b 
 and switch to code point mode. I haven't yet thought enough about combining 
 these two proposals to hold a strong opinion on the matter.
 
 there are two ways to match any Unicode
 grapheme that match existing regex library precedent:
 
 From Perl and PCRE:
 \X
 
 This doesn't work inside [].  Were you envisioning the same restriction in 
 JS?
 
 Also it matches a grapheme cluster, which is may be useful but is
 completely different to what the dot does.
 
 You are of course correct. And yes, I was envisioning the same restriction 
 within character classes. But I'm not a strong proponent of \X, especially if 
 support for Unicode categories is added.
 
 I agree with Steven that these two cases should just be left alone,
 which means they will continue to work the way they have until now.
 
 Glad to hear it.
 
 You seem to be confusing graphemes and unicode code points.
 [...]
 The proposal you are responding to is all about adding Unicode code
 point handling to regexps.  It is not about adding grapheme support,
 which is a rather different issue.
 
 Indeed. My response was rushed and poorly formed. My apologies.
 
 --Steven Levithan
 

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012/3/17 Norbert Lindenberg ecmascr...@norbertlindenberg.com:
 Steven, sorry, I wasn't aware of your proposal for /u when I inserted the 
 note on this flag into my proposal. My proposal was inspired by the use of /u 
 in PHP, where it switches from byte mode to UTF-8 mode. We'll have to see 
 whether it makes sense to combine the two under one flag or use two - 
 fortunately, Unicode still has a few other characters.

/foo/☃   // slash-unicode-snowman for the win! :-)

-- 
Erik Corry

P.S. I shudder to think what slash-pile-of-poo could mean.
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012/3/17 Steven L. steves_l...@hotmail.com:
 I further objected because I think the /u flag would be better used as a
 ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on
 Python's re.UNICODE or (?u) flag, which does the same thing except that it
 also covers \s (which is already Unicode-based in ES).

I am rather skeptical about treating \d like this.  I think any digit
including rods and roman characters but not decimal points/commas
http://en.wikipedia.org/wiki/Numerals_in_Unicode#Counting-rod_numerals
would be needed much less often than the digits 0-9, so I think
hijacking \d for this case is poor use of name space.  The \d escape
in perl does not cover other Unicode numerals, and even with the
[:name:] syntax there appears to be no way to get the Unicode
numerals: 
http://search.cpan.org/~flora/perl-5.14.2/pod/perlrecharclass.pod#POSIX_Character_Classes
 This suggests to me that it's not very useful.

And instead of changing the meaning of \w, which will be confusing, I
think that [:alnum:] as in perl would work fine.

\b is a little tougher.  The Unicode rewrite would be
(?:(?![:alnum:])(?=[:alnum:])|(?=[:alnum:])(?![:alnum:])) which is
obviously too verbose.  But if we take \b for this then the ASCII
version has to be written as
(?:(?!\w)(?=\w)|(?=\w)(?!\w)) which is also more than a little
annoying.  However, often you don't need that if you have negative
lookbehind because you can write something
like

/(?!\w)word(?=!\w)/// Negative look-behind for a \w and negative
look-ahead for \w at the end.

which isn't _too_ bad, even if it is much worse than

/\bword\b/

 Indeed. My response was rushed and poorly formed. My apologies.

Gratefully accepted with the hope that my next rushed and poorly
formed response will also be forgiven!

-- 
Erik Corry
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012-03-17 Thread Norbert Lindenberg


On Mar 17, 2012, at 10:20 , Erik Corry wrote:

 2012/3/17 Steven L. steves_l...@hotmail.com:
 Eric Corry wrote:
 
 However I think we probably do want the /u modifier on regexps to
 control the new backward-incompatible behaviour.  There may be some
 way to relax this for regexp literals in opted in Harmony code, but
 for new RegExp(...) and for other string literals I think there are
 rather too many inconsistencies with the old behaviour.
 
 
 Disagree with adding /u for this purpose and disagree with breaking backward
 compatibility to let `/./.exec(s)[0].length == 2`.
 
 Care to enlighten us with any thinking behind this disagreeing?
 
 Instead, if this is
 deemed an important enough issue, there are two ways to match any Unicode
 grapheme that match existing regex library precedent:
 
 From Perl and PCRE:
 
 \X
 
 This doesn't work inside [].  Were you envisioning the same restriction in JS?
 
 Also it matches a grapheme cluster, which is may be useful but is
 completely different to what the dot does.
 
 From Perl, PCRE, .NET, Java, XML Schema, and ICU (among others):
 
 \P{M}\p{M}*
 
 Obviously \X is prettier, but because it's fairly rare for people to care
 about this, IMO the more widely compatible solution that uses Unicode
 categories is Good Enough if Unicode category syntax is on the table for
 ES6.
 
 Norbert Lindenberg wrote:
 
 \u[\u-\u] is interpreted as [\u\u-\u\u]
 
 Norbert, this just happens automatically if unmatched surrogates are
 just treated as if they were normal code units.

I don't see how. In the actual matching process, the new design only looks at 
code points, not code units. Without this transformation, it would see 
surrogate code points in the pattern, but supplementary code points in the text 
to be matched. Enhancing the matching process to recognize surrogate code 
points and insert them into the continuation might work, but wouldn't be any 
prettier than this transformation.

 [\u-\u][\u-\u] is interpreted as
 [\u\u-\u\u]
 
 Norbert, this will have different semantics to the current
 implementations unless the second range is the full trail surrogate
 range.

True. I think if we restrict the transformation to that specific case it'll 
still cover normal usage of this pattern.

 I agree with Steven that these two cases should just be left alone,
 which means they will continue to work the way they have until now.
 
 Some people will want a way to match arbitrary Unicode code
 points rather than graphemes anyway, so leaving \u alone lets that use
 case continue working. This would still allow modifying the handling of
 literal astral/supplementary characters in RegExps. If it can be handled
 sensibly, I'm all for treating literal characters in RegExps as discrete
 graphemes rather than splitting them into surrogate pairs.
 
 You seem to be confusing graphemes and unicode code points.  Here is
 the same text 3 times:
 
 Four UTF-16 code units:
 
 0x0020 0xD800 0xDF30 0x0308
 
 Three Unicode code points:
 
 0x20 0x10330 0x308
 
 Two Graphemes
 
   ¨  -- This is an attempt to show a Gothic Ahsa with an umlaut.
 My mail program probably screwed it up.

Mac Mail is usually Unicode-friendly, so let's try again:
 ̰̈

 The proposal you are responding to is all about adding Unicode code
 point handling to regexps.  It is not about adding grapheme support,
 which is a rather different issue.

Correct - thanks for the explanation!

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012-03-17 Thread Norbert Lindenberg

On Mar 17, 2012, at 11:58 , Erik Corry wrote:

2012/3/17 Steven L. steves_l...@hotmail.com:
I further objected because I think the /u flag would be better used as a
ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on
Python's re.UNICODE or (?u) flag, which does the same thing except that it
also covers \s (which is already Unicode-based in ES).

I am rather skeptical about treating \d like this. I think any digit
including rods and roman characters but not decimal points/commas
http://en.wikipedia.org/wiki/Numerals_in_Unicode#Counting-rod_numerals
would be needed much less often than the digits 0-9, so I think
hijacking \d for this case is poor use of name space. The \d escape
in perl does not cover other Unicode numerals, and even with the
[:name:] syntax there appears to be no way to get the Unicode
numerals:
http://search.cpan.org/~flora/perl-5.14.2/pod/perlrecharclass.pod#POSIX_Character_Classes
This suggests to me that it's not very useful.

Looking at that page, it seems \d gives you a reasonable set of digits, the
ones in the Unicode general category Nd (number, decimal). These digits come
from a variety of writing systems, but are all used decimal-positional, so you
can parse at least integers using them with a fairly generic algorithm.

Dealing with roman numerals or counting rods requires specialized algorithms,
so you probably don't want to find them in this bucket.

Norbert

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

2012-03-17 Thread Steven L.


Eric Corry wrote:


I further objected because I think the /u flag would be better used as a
ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on
Python's re.UNICODE or (?u) flag, which does the same thing except that 
it

also covers \s (which is already Unicode-based in ES).


I am rather skeptical about treating \d like this.  I think any digit
including rods and roman characters but not decimal points/commas
http://en.wikipedia.org/wiki/Numerals_in_Unicode#Counting-rod_numerals
would be needed much less often than the digits 0-9, so I think
hijacking \d for this case is poor use of name space.  The \d escape
in perl does not cover other Unicode numerals, and even with the
[:name:] syntax there appears to be no way to get the Unicode
numerals: 
http://search.cpan.org/~flora/perl-5.14.2/pod/perlrecharclass.pod#POSIX_Character_Classes

 This suggests to me that it's not very useful.


I know from experience that it's common for Arabic speakers to want to match 
both 0-9 and Arabic-Indic digits. The same seems true for Hindi/Devanagari 
digits, and probably others. Even if it wasn't often useful, IMO this change 
is necessary for congruity with Unicode-enabled \w and \b (I'll get to 
that), and would likely never be detrimental since /u would be opt-in and 
it's easy to explicitly use [0-9] when that's what you want.


For the record, I am proposing that /\d/u be equivalent to /\p{Nd}/, not 
/\p{N}/. I.e., it should not match any Unicode number, but rather any 
Unicode decimal digit (see 
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5Cp%7BNd%7D for the 
list). And as Norbert noted, that is in fact what Perl's \d matches.


Comparison with other regex flavors:

* \w == [A-Za-z0-9_] -- ES-current, Java, PCRE, Ruby, Python (default).
* \w == [\p{L}\p{Nd}_] -- .NET, Perl, Python (with (?u)).

* \b matches between ASCII \w\W -- ES-current, PCRE, Ruby, Python (default).
* \b matches between Unicode \w\W -- Java, .NET, Perl, Python (with (?u)).

* \d == [0-9] -- ES-current, Java, PCRE, Ruby, Python (default).
* \d == \p{Nd} -- .NET, Perl, Python (with (?u)).

* \s == [\x09-\x0D] -- Java, PCRE, Ruby, Python (default).
* \s == [\x09–\x0D\p{Z}] -- ES-current, .NET, Perl, Python (with (?u)).

Note that Java's \w and \b are inconsistent.

Unicode-based \w and \b are incredibly useful, and it is very common for 
users to sometimes want them to be Unicode-based--thus, an opt-in flag 
offers the best of both worlds. In fact, I'd go so far as to say they are 
broken without Unicode support. Consider, e.g., /a\b/.test('naïve'), which 
currently returns true.


Unicode-based \d would not only help international users/apps, it is also 
important because otherwise Unicode-based \w and \b would have to use 
[\p{L}0-9_] rather than [\p{L}\p{Nd}_], which breaks portability with .NET, 
Perl, Python, and Java. If, conversely, Unicode-enabled \w and \b used 
[\p{L}\p{Nd}_] but \d used [0-9], then among other consequences (including 
user confusion), [^\W\d_] could not be used equivalently to \p{L}.



And instead of changing the meaning of \w, which will be confusing, I
think that [:alnum:] as in perl would work fine.


[:alnum:] in Perl, PCRE, Ruby, Tcl, POSIX/GNU BRE/ERE, etc. matches only 
[A-Za-z0-9]. Making it Unicode-based in ES would be confusing. It also works 
only within character classes. IMO, the POSIX-style [[:name:]] syntax is 
clumsy and confusing, not to mention backward incompatible. It would 
potentially also be confusing if ES supports only [:alnum:] without adding 
the rest of the (not-very-useful) POSIX regex class names.



\b is a little tougher.  The Unicode rewrite would be
(?:(?![:alnum:])(?=[:alnum:])|(?=[:alnum:])(?![:alnum:])) which is
obviously too verbose.  But if we take \b for this then the ASCII
version has to be written as
(?:(?!\w)(?=\w)|(?=\w)(?!\w)) which is also more than a little
annoying.  However, often you don't need that if you have negative
lookbehind because you can write something
like

/(?!\w)word(?=!\w)/// Negative look-behind for a \w and negative
look-ahead for \w at the end.

which isn't _too_ bad, even if it is much worse than

/\bword\b/


I've already started to explain above why I think Unicode-based \b is 
important and useful. I'll just add the footnote that relying on lookbehind 
would in all likelihood perform less efficiently than \b (depending on 
implementation optimizations).



Indeed. My response was rushed and poorly formed. My apologies.


Gratefully accepted with the hope that my next rushed and poorly
formed response will also be forgiven!


Consider it done. ;-P

--Steven Levithan


___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal