Re: Full Unicode based on UTF-16 proposal
On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg ecmascr...@norbertlindenberg.com wrote: The conformance clause doesn't say anything about the interpretation of (UTF-16) code units as code points. To check conformance with C1, you have to look at how the resulting code points are actually further interpreted. True, but if the proposed language A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a surrogate pair, is interpreted as a code point with the same value. is adopted, then will not this have an effect of creating unpaired surrogates as code points? If so, then by my estimation, this *will* increase the likelihood of their being interpreted as abstract characters... e.g., if the unpaired code unit is interpreted as a unpaired surrogate code point, and some process/function performs *any* predicate or transform on that code point, then that amounts to interpreting it as an abstract character. I would rather see such unpaired code unit either (1) be mapped to U+00FFFD, or (2) an exception raised when performing an operation that requires conversion of the UTF-16 code unit sequence. My proposal interprets the resulting code points in the following ways: 1) In regular expressions, they can be used in both patterns and input strings to be matched. They may be compared against other code points, or against character classes, some of which will hopefully soon be defined by Unicode properties. In the case of comparing against other code points, they can't match any code points assigned to abstract characters. In the case of Unicode properties, they'll typically fall into the large bucket of have-nots, along with other unassigned code points or, for example, U+FFFD, unless you ask for their general category. 2) When parsing identifiers, they will not have the ID_Start or ID_Continue properties, so they'll be excluded, just like other unassigned code points or U+FFFD. 3) In case conversion, they won't have upper case or lower case equivalents defined, and remain as is, as would happen for unassigned code points or U+FFFD. I don't think either of these amount to interpretation as abstract characters. I mention U+FFFD because the alternative interpretation of unpaired surrogates would be to replace them with U+FFFD, but that doesn't seem to improve anything. Norbert On Mar 26, 2012, at 15:10 , Glenn Adams wrote: On Mon, Mar 26, 2012 at 2:02 PM, Gavin Barraclough barraclo...@apple.com wrote: I really like the direction you're going in, but have one minor concern relating to regular expressions. In your proposal, you currently state: A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a surrogate pair, is interpreted as a code point with the same value. Just as a reminder, this would be in explicit violation of the Unicode conformance clause C1 unless it can be guaranteed that such a code point will not be interpreted as an abstract character: C1A process shall not interpret a high-surrogate code point or a low-surrogate code point as an abstract character. [1] http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf Given that such guarantee is likely impractical, this presents a problem for the above proposed language. ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
Erik Corry wrote: Steven Levithan wrote: - Make \d\w\b Unicode-aware. I think we should leave these alone. They are concise and useful and will continue to be so when /u is the default in Harmony code. Instead we should introduce \p{...} immediately which provides the same functionality. \w and \b are broken without Unicode. ASCII \d is concise and useful, but so is [0-9]. Unicode-aware \b can't be emulated using \p{..} unless lookbehind is also added (which is tentatively approved for ES6 but could get delayed). Unicode-aware \w\b\d are required by UTS#18. If \w\b\d are not made Unicode-aware by /u, we won't easily be able to fix them in the future. We went down this road before, and at the end you agreed that \w\b\d with /u should be Unicode aware. :/ I agree with adding \p{..} as soon as possible, with two caveats: * If I recall correctly, mobile browser implementers voiced concerns about overhead during the es4-discuss days. * It can easily be pushed down the road to ES7+. Delaying /u, on the other hand, might mean also having to delay Norbert's work on code point matching, etc. Introducing \p{..} without code point matching would be nonideal. \p{..} might *need* to be delayed anyway to allow RegExp proposals already approved by TC39 (match web reality, lookbehind, flag /y), the flag /x strawman, and flag /u to be completed in time. For starters, it's not clear which properties \p{..} in ES would support, and there would be a number of other details to discuss, too. Erik Corry wrote: Make unpaired surrogates in /u regexps a syntax error. Sounds good to me. -- Steven Levithan ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
This begs the question of what is the point of C1. On Tue, Mar 27, 2012 at 9:13 AM, Mark Davis ☕ m...@macchiato.com wrote: That would not be practical, nor predictable. And note that the 700K reserved code points are also not to be interpreted as characters; by your logic all of them would need to be converted to FFFD. And in practice, an unpaired surrogate is best treated just like a reserved (unassigned) code point. For example, a lowercase operation should convert characters with lowercase correspondants to those correspondants, and leave *everything* else alone: control characters, format characters, reserved code points, surrogates, etc. -- Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Tue, Mar 27, 2012 at 08:02, Glenn Adams gl...@skynav.com wrote: On Tue, Mar 27, 2012 at 8:39 AM, Mark Davis ☕ m...@macchiato.com wrote: That, as Norbert explained, is not the intention of the standard. Take a look at the discussion of Unicode 16-bit string in chapter 3. The committee recognized that fragments may be formed when working with UTF-16, and that destructive changes may do more harm than good. x = a.substring(0, 5) + b + a.substring(5, a.length()); y = x.substring(0, 5) + x.substring(6, x.length()); After this operation is done, you want y == a, even if 5 is between D800 and DC00. Assuming that b.length() == 1 in this example, my interpretation of this is that '=', '+', and 'substring' are operations whose domain and co-domain are (currently defined) ES Strings, namely sequences of UTF-16 code units. Since none of these operations entail interpreting the semantics of a code point (i.e., interpreting abstract characters), then there is no violation of C1 here. Or take: output = ; for (int i = 0; i s.length(); ++i) { ch = s.charAt(i); if (ch.equals('')) { ch = '@'; } output += ch; } After this operation is done, you want a\u{1}b to become a@\u{1}b, not a\u{FFFD}\u{FFFD}b. It is also an unnecessary burden on lower-level software to always check this stuff. Again, in this example, I assume that the string literal a\u{1}b maps to the UTF-16 code unit sequence: 0061 0026 D800 DC00 0062 Given that 'charAt(i)' is defined on (and is indexing) code units and not code points, and since the 'equals' operator is also defined on code units, this example also does not require interpreting the semantics of code points (i.e., interpreting abstract characters). However, in Norbert's questions above about isUUppercase(int) and toUpperCase(int), it is clear that the domain of these operations are code points, not code units, and further, that they requiring interpretation as abstract characters in order to determine the semantics of the corresponding characters. My conclusion is that the determination of whether C1 is violated or not depends upon the domain, codomain, and operation being considered. Of course, when you convert to UTF-16 (or UTF-8 or 32) for storage or output, then you do need to either convert to FFFD or take some other action. -- Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Mon, Mar 26, 2012 at 23:11, Glenn Adams gl...@skynav.com wrote: On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg ecmascr...@norbertlindenberg.com wrote: The conformance clause doesn't say anything about the interpretation of (UTF-16) code units as code points. To check conformance with C1, you have to look at how the resulting code points are actually further interpreted. True, but if the proposed language A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a surrogate pair, is interpreted as a code point with the same value. is adopted, then will not this have an effect of creating unpaired surrogates as code points? If so, then by my estimation, this *will* increase the likelihood of their being interpreted as abstract characters... e.g., if the unpaired code unit is interpreted as a unpaired surrogate code point, and some process/function performs *any* predicate or transform on that code point, then that amounts to interpreting it as an abstract character. I would rather see such unpaired code unit either (1) be mapped to U+00FFFD, or (2) an exception raised when performing an operation that requires conversion of the UTF-16 code unit sequence. My proposal interprets the resulting code points in the following ways: 1) In regular expressions, they can be used in both patterns and input strings to be matched. They may be compared against other code points, or against character classes, some of which will hopefully soon be defined by Unicode properties. In the case of comparing against other code points, they can't match any code points assigned to abstract characters. In the case of Unicode
Re: Full Unicode based on UTF-16 proposal
The point of C1 is that you can't interpret the surrogate code point U+DC00 as a *character*, like an a. Neither can you interpret the reserved code point U+0378 as a *character*, like a b. -- Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Tue, Mar 27, 2012 at 08:56, Glenn Adams gl...@skynav.com wrote: This begs the question of what is the point of C1. On Tue, Mar 27, 2012 at 9:13 AM, Mark Davis ☕ m...@macchiato.com wrote: That would not be practical, nor predictable. And note that the 700K reserved code points are also not to be interpreted as characters; by your logic all of them would need to be converted to FFFD. And in practice, an unpaired surrogate is best treated just like a reserved (unassigned) code point. For example, a lowercase operation should convert characters with lowercase correspondants to those correspondants, and leave *everything* else alone: control characters, format characters, reserved code points, surrogates, etc. -- Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Tue, Mar 27, 2012 at 08:02, Glenn Adams gl...@skynav.com wrote: On Tue, Mar 27, 2012 at 8:39 AM, Mark Davis ☕ m...@macchiato.comwrote: That, as Norbert explained, is not the intention of the standard. Take a look at the discussion of Unicode 16-bit string in chapter 3. The committee recognized that fragments may be formed when working with UTF-16, and that destructive changes may do more harm than good. x = a.substring(0, 5) + b + a.substring(5, a.length()); y = x.substring(0, 5) + x.substring(6, x.length()); After this operation is done, you want y == a, even if 5 is between D800 and DC00. Assuming that b.length() == 1 in this example, my interpretation of this is that '=', '+', and 'substring' are operations whose domain and co-domain are (currently defined) ES Strings, namely sequences of UTF-16 code units. Since none of these operations entail interpreting the semantics of a code point (i.e., interpreting abstract characters), then there is no violation of C1 here. Or take: output = ; for (int i = 0; i s.length(); ++i) { ch = s.charAt(i); if (ch.equals('')) { ch = '@'; } output += ch; } After this operation is done, you want a\u{1}b to become a@\u{1}b, not a\u{FFFD}\u{FFFD}b. It is also an unnecessary burden on lower-level software to always check this stuff. Again, in this example, I assume that the string literal a\u{1}b maps to the UTF-16 code unit sequence: 0061 0026 D800 DC00 0062 Given that 'charAt(i)' is defined on (and is indexing) code units and not code points, and since the 'equals' operator is also defined on code units, this example also does not require interpreting the semantics of code points (i.e., interpreting abstract characters). However, in Norbert's questions above about isUUppercase(int) and toUpperCase(int), it is clear that the domain of these operations are code points, not code units, and further, that they requiring interpretation as abstract characters in order to determine the semantics of the corresponding characters. My conclusion is that the determination of whether C1 is violated or not depends upon the domain, codomain, and operation being considered. Of course, when you convert to UTF-16 (or UTF-8 or 32) for storage or output, then you do need to either convert to FFFD or take some other action. -- Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Mon, Mar 26, 2012 at 23:11, Glenn Adams gl...@skynav.com wrote: On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg ecmascr...@norbertlindenberg.com wrote: The conformance clause doesn't say anything about the interpretation of (UTF-16) code units as code points. To check conformance with C1, you have to look at how the resulting code points are actually further interpreted. True, but if the proposed language A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a surrogate pair, is interpreted as a code point with the same value. is adopted, then will not this have an effect of creating unpaired surrogates as code points? If so, then by my estimation, this *will* increase the likelihood of their being interpreted as abstract characters... e.g., if the unpaired code unit is interpreted as a unpaired surrogate code point, and some process/function performs *any* predicate or transform on that code point, then that amounts to interpreting it as an abstract character. I would rather see such unpaired code unit either (1) be mapped to U+00FFFD, or (2) an exception raised when performing an operation that requires conversion of the UTF-16 code unit sequence. My proposal interprets the resulting code points in the following ways: 1)
Re: Full Unicode based on UTF-16 proposal
On Mar 26, 2012, at 11:57 PM, Erik Corry wrote: Add /U to mean old-style regexp literals in Harmony code (analogous to /s and /S which have opposite meanings). Are we sure this has enough utility to be worth adding? - it seems unlikely that programmers are going to often have cause to explicitly opt-out of correct unicode support (since little consideration usually seems to be given to this topic), and as discussed previously, a mechanism to do so already exists if they need it (RegExp(foo) will behave the same as the proposed /foo/U). If we do add a 'U' flag, I'd worry that it may end up more commonly being used in error when people intended to append a 'u'! cheers, G. ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
So, if as a result of a policy of converting any UTF-16 code unit sequence to a code point sequence one ends up with an unpaired surrogate, e.g., \u{00DC00}, then performing a predicate on that code point, such as described in D21 (e.g., IsAlphabetic) would entail interpreting it as an abstract character? I can see that D20 defines code point properties which would not entail interpreting as an abstract character, e.g., IsSurrogate, IsNonCharacter, but where does one draw the line? On Tue, Mar 27, 2012 at 11:15 AM, Mark Davis ☕ m...@macchiato.com wrote: The point of C1 is that you can't interpret the surrogate code point U+DC00 as a *character*, like an a. Neither can you interpret the reserved code point U+0378 as a *character*, like a b. -- Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Tue, Mar 27, 2012 at 08:56, Glenn Adams gl...@skynav.com wrote: This begs the question of what is the point of C1. On Tue, Mar 27, 2012 at 9:13 AM, Mark Davis ☕ m...@macchiato.com wrote: That would not be practical, nor predictable. And note that the 700K reserved code points are also not to be interpreted as characters; by your logic all of them would need to be converted to FFFD. And in practice, an unpaired surrogate is best treated just like a reserved (unassigned) code point. For example, a lowercase operation should convert characters with lowercase correspondants to those correspondants, and leave *everything* else alone: control characters, format characters, reserved code points, surrogates, etc. -- Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Tue, Mar 27, 2012 at 08:02, Glenn Adams gl...@skynav.com wrote: On Tue, Mar 27, 2012 at 8:39 AM, Mark Davis ☕ m...@macchiato.comwrote: That, as Norbert explained, is not the intention of the standard. Take a look at the discussion of Unicode 16-bit string in chapter 3. The committee recognized that fragments may be formed when working with UTF-16, and that destructive changes may do more harm than good. x = a.substring(0, 5) + b + a.substring(5, a.length()); y = x.substring(0, 5) + x.substring(6, x.length()); After this operation is done, you want y == a, even if 5 is between D800 and DC00. Assuming that b.length() == 1 in this example, my interpretation of this is that '=', '+', and 'substring' are operations whose domain and co-domain are (currently defined) ES Strings, namely sequences of UTF-16 code units. Since none of these operations entail interpreting the semantics of a code point (i.e., interpreting abstract characters), then there is no violation of C1 here. Or take: output = ; for (int i = 0; i s.length(); ++i) { ch = s.charAt(i); if (ch.equals('')) { ch = '@'; } output += ch; } After this operation is done, you want a\u{1}b to become a@\u{1}b, not a\u{FFFD}\u{FFFD}b. It is also an unnecessary burden on lower-level software to always check this stuff. Again, in this example, I assume that the string literal a\u{1}b maps to the UTF-16 code unit sequence: 0061 0026 D800 DC00 0062 Given that 'charAt(i)' is defined on (and is indexing) code units and not code points, and since the 'equals' operator is also defined on code units, this example also does not require interpreting the semantics of code points (i.e., interpreting abstract characters). However, in Norbert's questions above about isUUppercase(int) and toUpperCase(int), it is clear that the domain of these operations are code points, not code units, and further, that they requiring interpretation as abstract characters in order to determine the semantics of the corresponding characters. My conclusion is that the determination of whether C1 is violated or not depends upon the domain, codomain, and operation being considered. Of course, when you convert to UTF-16 (or UTF-8 or 32) for storage or output, then you do need to either convert to FFFD or take some other action. -- Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Mon, Mar 26, 2012 at 23:11, Glenn Adams gl...@skynav.com wrote: On Mon, Mar 26, 2012 at 10:37 PM, Norbert Lindenberg ecmascr...@norbertlindenberg.com wrote: The conformance clause doesn't say anything about the interpretation of (UTF-16) code units as code points. To check conformance with C1, you have to look at how the resulting code points are actually further interpreted. True, but if the proposed language A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a surrogate pair, is interpreted as a code point with the same value. is adopted, then will not this have an effect of creating unpaired surrogates as code points? If so, then by my estimation, this *will* increase
Re: Full Unicode based on UTF-16 proposal
ok, i'll accept your position at this point and drop my comment; i suppose it is true that if there are already unpaired surrogates in user data as UTF-16, then having unpaired surrogates as code points is no worse; however, it would be useful if there were an informative pointer from the spec under consideration to a UTC sanctioned list of operations that constitute interpreting as abstract characters and, that, if used on such data would possibly violate C1; to this end, it would be useful if C1 itself included a concrete example of such an operation On Tue, Mar 27, 2012 at 2:02 PM, Mark Davis ☕ m...@macchiato.com wrote: performing a predicate on that code point, such as described in D21 (e.g., IsAlphabetic) would entail interpreting it as an abstract character? No. but where does one draw the line? The line is already drawn by the Unicode consortium, by consulting the Unicode Character Database properties. If you look at the data in the Unicode Character Database for any particular property, say Alphabetic, you'll find that surrogate code points are not included where the property is a true character property. There are a few special cases where reserved code points are provisionally given anticipatory character properties, such as in bidi ranges, simply because that makes implementations is more forward compatible, but there aren't any cases where a character property applies to a surrogate code point (other than by returning No, or n/a, or some such). -- Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Tue, Mar 27, 2012 at 12:07, Glenn Adams gl...@skynav.com wrote: So, if as a result of a policy of converting any UTF-16 code unit sequence to a code point sequence one ends up with an unpaired surrogate, e.g., \u{00DC00}, then performing a predicate on that code point, such as described in D21 (e.g., IsAlphabetic) would entail interpreting it as an abstract character? I can see that D20 defines code point properties which would not entail interpreting as an abstract character, e.g., IsSurrogate, IsNonCharacter, but where does one draw the line? On Tue, Mar 27, 2012 at 11:15 AM, Mark Davis ☕ m...@macchiato.comwrote: The point of C1 is that you can't interpret the surrogate code point U+DC00 as a *character*, like an a. Neither can you interpret the reserved code point U+0378 as a *character*, like a b. -- Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Tue, Mar 27, 2012 at 08:56, Glenn Adams gl...@skynav.com wrote: This begs the question of what is the point of C1. On Tue, Mar 27, 2012 at 9:13 AM, Mark Davis ☕ m...@macchiato.comwrote: That would not be practical, nor predictable. And note that the 700K reserved code points are also not to be interpreted as characters; by your logic all of them would need to be converted to FFFD. And in practice, an unpaired surrogate is best treated just like a reserved (unassigned) code point. For example, a lowercase operation should convert characters with lowercase correspondants to those correspondants, and leave *everything* else alone: control characters, format characters, reserved code points, surrogates, etc. -- Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Tue, Mar 27, 2012 at 08:02, Glenn Adams gl...@skynav.com wrote: On Tue, Mar 27, 2012 at 8:39 AM, Mark Davis ☕ m...@macchiato.comwrote: That, as Norbert explained, is not the intention of the standard. Take a look at the discussion of Unicode 16-bit string in chapter 3. The committee recognized that fragments may be formed when working with UTF-16, and that destructive changes may do more harm than good. x = a.substring(0, 5) + b + a.substring(5, a.length()); y = x.substring(0, 5) + x.substring(6, x.length()); After this operation is done, you want y == a, even if 5 is between D800 and DC00. Assuming that b.length() == 1 in this example, my interpretation of this is that '=', '+', and 'substring' are operations whose domain and co-domain are (currently defined) ES Strings, namely sequences of UTF-16 code units. Since none of these operations entail interpreting the semantics of a code point (i.e., interpreting abstract characters), then there is no violation of C1 here. Or take: output = ; for (int i = 0; i s.length(); ++i) { ch = s.charAt(i); if (ch.equals('')) { ch = '@'; } output += ch; } After this operation is done, you want a\u{1}b to become a@\u{1}b, not a\u{FFFD}\u{FFFD}b. It is also an unnecessary burden on lower-level software to always check this stuff. Again, in this example, I assume that the string literal a\u{1}b maps to the UTF-16 code unit sequence: 0061 0026 D800 DC00 0062 Given that
Re: Full Unicode based on UTF-16 proposal
Perfectly valid concerns. My thinking here is that normally applications want to deal with code points, but we force them to deal with UTF-16 and additional flags because we need them for compatibility. Within modules, where we know that compatibility is not an issue, I'd rather give applications by default what they need. Looking back at Java, supporting supplementary characters was fairly painless for many applications despite UTF-16 because Java already had a rich API performing all kinds of operations on strings, so many applications had little need to look at individual characters in the first place. We went through the entire Java SE API and fixed all those operations to use code point semantics (look for under the hood at [1] for details). We were also able to switch regular expressions to code point semantics without any flags because regular expressions never worked on binary data and developers hadn't created funky workarounds to support supplementary characters yet. JavaScript today has more constraints, but for new development it would still be good to get as close as possible to that experience. Norbert [1] http://java.sun.com/developer/technicalArticles/Intl/Supplementary/ On Mar 24, 2012, at 23:56 , David Herman wrote: On Mar 24, 2012, at 4:32 PM, Norbert Lindenberg wrote: One concern: I think code point based matching should be the default for regex literals within modules (where we know the code is written for Harmony). This idea makes me nervous. Partly because I think we should keep the set of semantic changes between non-module code and module code reasonable small, and partly because the idea of your proposal is to continue to treat strings as sequences of 16-bit code units, not Unicode code points-- which means that quietly switching regexps to be closer to operating at the level of code points seems like it creates a kind of impedance mismatch. It feels more appropriate to me to require programmers to declare explicitly that they're dealing with a string at the level of code points, using the (quite concise) /u flag. That way they're saying yes, I know this string is just a sequence of 16-bit code points, but it may contain non-BMP data, and I would like to match its contents with a regexp that deals with code points. (Again, I'm still new to the finer points of Unicode, so I'm prepared to be shown I'm thinking about it wrong.) Dave ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
Let's see: - Conversion to UTF-8: If the string isn't well-formed, you wouldn't refuse to convert it, so isValid doesn't really help. You still have to look at all code units, and convert unpaired surrogates to the UTF-8 sequence for U+FFFD. - Conversion from UTF-8: For security reasons, you have to check for well-formedness before conversion, in particular to catch non-shortest forms [1]. - HTML form data: Same situation as conversion to UTF-8. - Base64 encodes binary data, so UTF-16 well-formedness rules don't apply. I don't think we'd add API just to flag an issue - that's what documentation is for. Norbert [1] http://www.unicode.org/reports/tr36/#UTF-8_Exploit On Mar 25, 2012, at 1:57 , Roger Andrews wrote: I use something like String.isValid functionality in a transcoder that converts Strings to/from UTF-8, HTML Formdata (MIME type application/x-www-form-urlencoded -- not the same as URI encoding!), and Base64. Admittedly these currently use 'encodeURI' to do the work, or it just drops out naturally when considering UTF-8 sequences. (I considered testing the regexp /^(?:[\u-\uD7FF\uE000-\u]|[\uD800-\uDBFF][\uDC00-\uDFFF])*$/ against the input string.) Maybe the function is too obscure for general use, although its presence does flag up the surrogate-pair issue to developers. -- From: Norbert Lindenberg ecmascr...@norbertlindenberg.com It's easy to provide this function, but in which situations would it be useful? In most cases that I can think of you're interested in far more constrained definitions of validity: - what are valid ECMAScript identifiers? - what are valid BCP 47 language tags? - what are the characters allowed in a certain protocol? - what are the characters that my browser can render? Thanks, Norbert On Mar 24, 2012, at 12:12 , David Herman wrote: On Mar 23, 2012, at 11:45 AM, Roger Andrews wrote: Concerning UTF-16 surrogate pairs, how about a function like: String.isValid( str ) to discover whether surrogates are used correctly in 'str'? Something like Array.isArray(). No need for it to be a class method, since it only operates on strings. We could simply have String.prototype.isValid(). Note that it would work for primitive strings as well, thanks to JS's automatic promotion semantics. Dave ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
There is a strawman for code point escapes: http://wiki.ecmascript.org/doku.php?id=strawman:full_unicode_source_code#unicode_escape_sequences Note that for references to specific characters it's usually best to just use the characters directly, as Dave did in 팆팇팈팉팊.match(/[팆-퍖]+/u). Escapes can be useful in cases such as regular expressions where you might have to refer to range limits that aren't actually assigned characters, or in test cases where you might use characters for which your OS doesn't have glyphs yet. Norbert On Mar 25, 2012, at 2:57 , Roger Andrews wrote: Doesn't C/C++ allow non-BMP code points using \U in character literals. The \U format expresses a full 32-bit code, which could be mapped internally to two 16-bit UTF-16 codes. Then the programmer can describe exactly the required characters without caring about their coding in UTF-16 or whatever. Could you use this to avoid complicated things in RegExps like [{\u\u}-{\u\u}], instead have things like [\U0001-\U0003] -- naturally expressing the characters of interest? The same goes for String literals, where the programmer does not really care about the encoding, just specifying the character. (Sorry if I've missed something in the prior discussion.) -- From: Norbert Lindenberg To: David Herman On Mar 24, 2012, at 12:21 , David Herman wrote: [snip] As for whether the switch to code-point-based matching should be universal or require /u (an issue that your proposal leaves open), IMHO it's better to require /u since it avoids the need for transforming \u[\u-\u] to [{\u\u}-{\u\u}] and [\u-\u][\uDC00-\uDFFF] to [{\u\uDC00}-{\u\uDFFF}], and additionally avoids as least three potentially breaking changes (two of which are explicitly mentioned in your proposal): I haven't completely understood this part of the discussion. Looking at /u as a little red switch (LRS), i.e., an opportunity to make judicious breaks with compatibility, could we not allow character classes with unescaped non-BMP code points, e.g.: js 팆팇팈팉팊.match(/[팆-퍖]+/u) [팆팇팈팉팊] I'm still getting up to speed on Unicode and JS string semantics, so I'm guessing that I'm missing a reason why that wouldn't work... Presumably the JS source of the regexp literal, as a sequence of UTF-16 code units, represents the tetragram code points as surrogate pairs. Can we not recognize surrogate pairs in character classes within a /u regexp and interpret them as code points? With /u, that's exactly what happens. My first proposal was to make this happen even without a new flag, i.e., make 팆팇팈팉팊.match(/[팆-퍖]+/) work based on code points, and Steve is arguing against that because of compatibility risk. My proposal also includes some transformations to keep existing regular expressions working, and Steve correctly observes that if we have a flag for code point mode, then the transformation is not needed - old regular expressions would continue to work in code unit mode, while new regular expressions with /u get code point treatment. ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
The strawman is for source code characters, and says it has no implications for string value encodings (or RegExps). String regexp literal escape sequences are explicitly defined in ES5 sections 7.8.4 7.8.5. Will Strawman style also work in ES6 string regexp literals? Thus making regexp ranges much nicer (see final example below). As well as describing code points that have not yet been defined as characters, character escapes in string literals and regexps are good: 1) control characters don't have glyphs at all, 2) the various space glyphs are not readily distinguishable (same for some dash/minus/line glyphs), 3) breaking/non-breaking versions of characters are not distinguishable, 4) many other glyphs are hard to distinguish (being tiny adjustments in positioning or form detail), 5) some characters are combining -- which makes for a messy and confusing program if you use them raw. If you use the raw non-ASCII characters in a program then you need some means of creating them, preferably via a normal keyboard and in your favourite text editor. All program readers need appropriate fonts installed to fully understand the program, and program maintainers also need a Unicode-capable text editor (potentially including non-BMP support). All links/stores that the program travels over or rests in must be Unicode-capable. Whereas using only ASCII chars to write a program is easy to do and always works no matter how basic your computing/transmission infrastructure. (ASCII chars never get silently mangled in transmission or text editors.) How to represent character escapes in a language. C/C++ has: \xNN8-bit char (U+ - U+00FF) \u 16-bit char (U+ - U+) \U32-bit char (i.e. any 21-bit Unicode char) Strawman for source chars has: \u{N...} 8 to 24-bit char (i.e. any 21-bit Unicode char) I'm struggling with how non-BMP escapes would be used in practice in strings regexps -- especially regexp ranges. Will Strawman style be used in string regexp literals? Considering U+1D307 (팇) as an example (where 팇 == \uD834\uDF07). To create the string I like 팇 using escapes in C/C++ you can create a string: I like \U0001D307 if the Strawman style works in strings, in ES6 presumably you say: I like \u{1D307} or do you have to know UTF-16 encoding rules and say: I like \uD834\uDF07 To use U+1D307 (팇) and U+1D356 (퍖) as a range in a regexp, i.e. /[팇-퍖]/ should the programmer write: C/C++ style /[\U0001D307-\U0001D356]/ or will Strawman style work in regexps /[\u{1D307}-\u{1D356}]/ or in UTF-16 with {} grouping /[{\uD834\uDF07}-{\uD834\uDF56}]/ Either C/C++ style or Strawman style escape is readable, natural, doesn't require knowledge of UTF-16 encoding rules, can be created easily with any old keyboard, and won't upset text editors. It's a bit unfriendly to require programmers to know UTF-16 rules just to put a non-BMP character in a string or regexp using an escape. And in a regexp range it looks ugly and confusing. -- From: Norbert Lindenberg There is a strawman for code point escapes: http://wiki.ecmascript.org/doku.php?id=strawman:full_unicode_source_code#unicode_escape_sequences Note that for references to specific characters it's usually best to just use the characters directly, as Dave did in 팆팇팈팉팊.match(/[팆-퍖]+/u). Escapes can be useful in cases such as regular expressions where you might have to refer to range limits that aren't actually assigned characters, or in test cases where you might use characters for which your OS doesn't have glyphs yet. Norbert On Mar 25, 2012, at 2:57 , Roger Andrews wrote: Doesn't C/C++ allow non-BMP code points using \U in character literals. The \U format expresses a full 32-bit code, which could be mapped internally to two 16-bit UTF-16 codes. Then the programmer can describe exactly the required characters without caring about their coding in UTF-16 or whatever. Could you use this to avoid complicated things in RegExps like [{\u\u}-{\u\u}], instead have things like [\U0001-\U0003] -- naturally expressing the characters of interest? The same goes for String literals, where the programmer does not really care about the encoding, just specifying the character. (Sorry if I've missed something in the prior discussion.) -- From: Norbert Lindenberg To: David Herman On Mar 24, 2012, at 12:21 , David Herman wrote: [snip] As for whether the switch to code-point-based matching should be universal or require /u (an issue that your proposal leaves open), IMHO it's better to require /u since it avoids the need for transforming \u[\u-\u] to [{\u\u}-{\u\u}] and [\u-\u][\uDC00-\uDFFF] to [{\u\uDC00}-{\u\uDFFF}], and additionally
Re: Full Unicode based on UTF-16 proposal
Maybe String.isValid is just not generally useful enough. I accept the point that you don't add APIs simply to flag an issue, (there has to be more weighty justification to carry the trifle). PS: As for UTF-16 - UTF-8 or HTML-Formdata, I decided to follow encodeURI / encodeURIComponent's lead and throw an exception. Maybe that's the wrong thing to do? My UTF-8 - UTF-16 does check for well-formed UTF-8 because it seemed the right thing to do. Thanks for the link which explains why. Base64 encodes 8-bit octets, so UTF-16 first gets converted to UTF-8, same issues as above really. -- From: Norbert Lindenberg Let's see: - Conversion to UTF-8: If the string isn't well-formed, you wouldn't refuse to convert it, so isValid doesn't really help. You still have to look at all code units, and convert unpaired surrogates to the UTF-8 sequence for U+FFFD. - Conversion from UTF-8: For security reasons, you have to check for well-formedness before conversion, in particular to catch non-shortest forms [1]. - HTML form data: Same situation as conversion to UTF-8. - Base64 encodes binary data, so UTF-16 well-formedness rules don't apply. I don't think we'd add API just to flag an issue - that's what documentation is for. Norbert [1] http://www.unicode.org/reports/tr36/#UTF-8_Exploit On Mar 25, 2012, at 1:57 , Roger Andrews wrote: I use something like String.isValid functionality in a transcoder that converts Strings to/from UTF-8, HTML Formdata (MIME type application/x-www-form-urlencoded -- not the same as URI encoding!), and Base64. Admittedly these currently use 'encodeURI' to do the work, or it just drops out naturally when considering UTF-8 sequences. (I considered testing the regexp /^(?:[\u-\uD7FF\uE000-\u]|[\uD800-\uDBFF][\uDC00-\uDFFF])*$/ against the input string.) Maybe the function is too obscure for general use, although its presence does flag up the surrogate-pair issue to developers. -- From: Norbert Lindenberg ecmascr...@norbertlindenberg.com It's easy to provide this function, but in which situations would it be useful? In most cases that I can think of you're interested in far more constrained definitions of validity: - what are valid ECMAScript identifiers? - what are valid BCP 47 language tags? - what are the characters allowed in a certain protocol? - what are the characters that my browser can render? Thanks, Norbert On Mar 24, 2012, at 12:12 , David Herman wrote: On Mar 23, 2012, at 11:45 AM, Roger Andrews wrote: Concerning UTF-16 surrogate pairs, how about a function like: String.isValid( str ) to discover whether surrogates are used correctly in 'str'? Something like Array.isArray(). No need for it to be a class method, since it only operates on strings. We could simply have String.prototype.isValid(). Note that it would work for primitive strings as well, thanks to JS's automatic promotion semantics. Dave ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
Steven Levithan wrote: [snip] * /\u{10}/ eq /u{10}/ (literal u repeated 10 times). A point in favour of \U over \u{x...} as a representation of character escapes? -- to avoid ambiguity in regexps. ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
OK, I guess we have to have Unicode code point escapes :-) I'd expect them to work in identifiers, string literals, and regular expressions (possibly with restrictions coming out of today's emails), but not in JSON source. Norbert On Mar 26, 2012, at 4:45 , Roger Andrews wrote: The strawman is for source code characters, and says it has no implications for string value encodings (or RegExps). String regexp literal escape sequences are explicitly defined in ES5 sections 7.8.4 7.8.5. Will Strawman style also work in ES6 string regexp literals? Thus making regexp ranges much nicer (see final example below). As well as describing code points that have not yet been defined as characters, character escapes in string literals and regexps are good: 1) control characters don't have glyphs at all, 2) the various space glyphs are not readily distinguishable (same for some dash/minus/line glyphs), 3) breaking/non-breaking versions of characters are not distinguishable, 4) many other glyphs are hard to distinguish (being tiny adjustments in positioning or form detail), 5) some characters are combining -- which makes for a messy and confusing program if you use them raw. If you use the raw non-ASCII characters in a program then you need some means of creating them, preferably via a normal keyboard and in your favourite text editor. All program readers need appropriate fonts installed to fully understand the program, and program maintainers also need a Unicode-capable text editor (potentially including non-BMP support). All links/stores that the program travels over or rests in must be Unicode-capable. Whereas using only ASCII chars to write a program is easy to do and always works no matter how basic your computing/transmission infrastructure. (ASCII chars never get silently mangled in transmission or text editors.) How to represent character escapes in a language. C/C++ has: \xNN8-bit char (U+ - U+00FF) \u 16-bit char (U+ - U+) \U32-bit char (i.e. any 21-bit Unicode char) Strawman for source chars has: \u{N...} 8 to 24-bit char (i.e. any 21-bit Unicode char) I'm struggling with how non-BMP escapes would be used in practice in strings regexps -- especially regexp ranges. Will Strawman style be used in string regexp literals? Considering U+1D307 (팇) as an example (where 팇 == \uD834\uDF07). To create the string I like 팇 using escapes in C/C++ you can create a string: I like \U0001D307 if the Strawman style works in strings, in ES6 presumably you say: I like \u{1D307} or do you have to know UTF-16 encoding rules and say: I like \uD834\uDF07 To use U+1D307 (팇) and U+1D356 (퍖) as a range in a regexp, i.e. /[팇-퍖]/ should the programmer write: C/C++ style /[\U0001D307-\U0001D356]/ or will Strawman style work in regexps /[\u{1D307}-\u{1D356}]/ or in UTF-16 with {} grouping /[{\uD834\uDF07}-{\uD834\uDF56}]/ Either C/C++ style or Strawman style escape is readable, natural, doesn't require knowledge of UTF-16 encoding rules, can be created easily with any old keyboard, and won't upset text editors. It's a bit unfriendly to require programmers to know UTF-16 rules just to put a non-BMP character in a string or regexp using an escape. And in a regexp range it looks ugly and confusing. -- From: Norbert Lindenberg There is a strawman for code point escapes: http://wiki.ecmascript.org/doku.php?id=strawman:full_unicode_source_code#unicode_escape_sequences Note that for references to specific characters it's usually best to just use the characters directly, as Dave did in 팆팇팈팉팊.match(/[팆-퍖]+/u). Escapes can be useful in cases such as regular expressions where you might have to refer to range limits that aren't actually assigned characters, or in test cases where you might use characters for which your OS doesn't have glyphs yet. Norbert On Mar 25, 2012, at 2:57 , Roger Andrews wrote: Doesn't C/C++ allow non-BMP code points using \U in character literals. The \U format expresses a full 32-bit code, which could be mapped internally to two 16-bit UTF-16 codes. Then the programmer can describe exactly the required characters without caring about their coding in UTF-16 or whatever. Could you use this to avoid complicated things in RegExps like [{\u\u}-{\u\u}], instead have things like [\U0001-\U0003] -- naturally expressing the characters of interest? The same goes for String literals, where the programmer does not really care about the encoding, just specifying the character. (Sorry if I've missed something in the prior discussion.) -- From: Norbert Lindenberg To: David Herman On Mar 24, 2012, at 12:21 ,
Re: Full Unicode based on UTF-16 proposal
On Mar 26, 2012, at 13:02 , Gavin Barraclough wrote: Hi Norbert, I really like the direction you're going in, but have one minor concern relating to regular expressions. In your proposal, you currently state: A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a surrogate pair, is interpreted as a code point with the same value. I think this makes sense in the context of your original proposal, which seeks to be backwards compatible with existing regular expressions through the range transformations. But I'm concerned that this might prove problematic, and would suggest that if we're going to make unicode regexp match opt-in through a /u flag then instead it may be better to make unpaired surrogates in unicode expressions a syntax error. That's worth considering. It seems we're more and more moving towards two separate RegExp versions anyway - a legacy version based on code units and with all kinds of quirks, and an all-around-better version based on code points. It means however that you can't easily remove unpaired surrogates by str.replace(/[\u{D800}-\u{DFFF}]/ug, \u{FFFD}) My concern would be expressions such as: /[\uD800\uDC00\uDC00\uD800]/u Under my reading of the current proposal, this could match any of \uD800\uDC00, \uD800, or \uDC00. Allowing this seems to introduce the concept of precedence to character classes (given an input \uD800\uDC00, should I choose to match \uD800\uDC00 or \uD800?). It may also significantly complicate the implementation of backtracking if we were to allow this (if I have matched \uD800\uDC00, should I step back by one code unit or two?). I think/hope that my specification is clear: a surrogate pair is always treated as one entity, not as two pieces. If the input is \uD800\uDC00, you match \uD800\uDC00. If you have to backtrack over \uD800\uDC00, you step back two code units. It also just seems much clearer from a user perspective to say that non-unicode regular expressions match code units, unicode regular expressions match code points - mixing the two seems unhelpful. If opt-in is automatic in modules, programmers will likely want an escape to be able to write non-unicode regular expressions, but I don't think this should warrant an extra flag, I don't think we can automatically change the behaviour of the RegExp constructor (without a u flag being passed), so RegExp(\uD800) should still be available to support non-unicode matching within modules. Agreed, especially after reading Erik's and your additional emails on this. ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
The conformance clause doesn't say anything about the interpretation of (UTF-16) code units as code points. To check conformance with C1, you have to look at how the resulting code points are actually further interpreted. My proposal interprets the resulting code points in the following ways: 1) In regular expressions, they can be used in both patterns and input strings to be matched. They may be compared against other code points, or against character classes, some of which will hopefully soon be defined by Unicode properties. In the case of comparing against other code points, they can't match any code points assigned to abstract characters. In the case of Unicode properties, they'll typically fall into the large bucket of have-nots, along with other unassigned code points or, for example, U+FFFD, unless you ask for their general category. 2) When parsing identifiers, they will not have the ID_Start or ID_Continue properties, so they'll be excluded, just like other unassigned code points or U+FFFD. 3) In case conversion, they won't have upper case or lower case equivalents defined, and remain as is, as would happen for unassigned code points or U+FFFD. I don't think either of these amount to interpretation as abstract characters. I mention U+FFFD because the alternative interpretation of unpaired surrogates would be to replace them with U+FFFD, but that doesn't seem to improve anything. Norbert On Mar 26, 2012, at 15:10 , Glenn Adams wrote: On Mon, Mar 26, 2012 at 2:02 PM, Gavin Barraclough barraclo...@apple.com wrote: I really like the direction you're going in, but have one minor concern relating to regular expressions. In your proposal, you currently state: A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a surrogate pair, is interpreted as a code point with the same value. Just as a reminder, this would be in explicit violation of the Unicode conformance clause C1 unless it can be guaranteed that such a code point will not be interpreted as an abstract character: C1A process shall not interpret a high-surrogate code point or a low-surrogate code point as an abstract character. [1] http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf Given that such guarantee is likely impractical, this presents a problem for the above proposed language. ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
Norbert Lindenberg wrote: The ugly world of web reality... Actually, in V8, Firefox, Safari, and IE, /[\u{1}]/ seems to be the same as /[\\u01{}]/ - it matches \\u01{}u01. In Opera, it doesn't seem to match anything, but doesn't throw the specified SyntaxError either. How did you test this. I get consistent results that agree with Erik in IE 9, Firefox 11, Chrome 17, and Safari 5.1: \\u01{}.match(/[\u{1}]/g); // ['u','0','1','{','}'] /\u{2}/g.test(uu); // true Opera, as you said, returns null and false (tested v11.6 and v10.0). Do we know of any applications actually relying on these bugs, seeing that browsers don't agree on them? Minus Opera, browsers do agree on them. Admirably so. And they aren't bugs--they're intentional breaks from ES for backcompat with earlier implementations that were themselves designed for backcompat with older non-ES regex behavior. The RegExp Match Web Reality proposal at http://wiki.ecmascript.org/doku.php?id=harmony:regexp_match_web_reality says to add them to the spec, and Allen has said the web reality proposal should be the top RegExp priority for ES6. I'd easily believe it's safe enough to change /[\u{n..}]/ because of the four-part sequence involved in \u + { + n.. + } that is fairly unlikely to appear in that specific order in a character class. But I'd have a harder time believing /\u{n..}/ is safe to change. It would of course be great to have some real data on the risks/damage. For string literals, I see that most implementations correctly throw a SyntaxError when given \u{10}. The exception here is V8. I'm sure it would be safer to allow \u{n..} for string literals even if this fortunate SyntaxError wasn't thrown. Users haven't been trained to think of escaped nonmetacharacters as safe for string literals to the extent that they have for regexes, and you can't programmatically generate such escapes so easily as when passing to the RegExp constructor. -- Steven Levithan ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
It's easy to provide this function, but in which situations would it be useful? In most cases that I can think of you're interested in far more constrained definitions of validity: - what are valid ECMAScript identifiers? - what are valid BCP 47 language tags? - what are the characters allowed in a certain protocol? - what are the characters that my browser can render? Thanks, Norbert On Mar 24, 2012, at 12:12 , David Herman wrote: On Mar 23, 2012, at 11:45 AM, Roger Andrews wrote: Concerning UTF-16 surrogate pairs, how about a function like: String.isValid( str ) to discover whether surrogates are used correctly in 'str'? Something like Array.isArray(). No need for it to be a class method, since it only operates on strings. We could simply have String.prototype.isValid(). Note that it would work for primitive strings as well, thanks to JS's automatic promotion semantics. Dave ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
On Mar 24, 2012, at 11:23 PM, Norbert Lindenberg wrote: On Mar 24, 2012, at 12:21 , David Herman wrote: I'm still getting up to speed on Unicode and JS string semantics, so I'm guessing that I'm missing a reason why that wouldn't work... Presumably the JS source of the regexp literal, as a sequence of UTF-16 code units, represents the tetragram code points as surrogate pairs. Can we not recognize surrogate pairs in character classes within a /u regexp and interpret them as code points? With /u, that's exactly what happens. My first proposal was to make this happen even without a new flag, i.e., make 팆팇팈팉팊.match(/[팆-퍖]+/) work based on code points, and Steve is arguing against that because of compatibility risk. My proposal also includes some transformations to keep existing regular expressions working, and Steve correctly observes that if we have a flag for code point mode, then the transformation is not needed - old regular expressions would continue to work in code unit mode, while new regular expressions with /u get code point treatment. Excellent! Thanks, Dave ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
I use something like String.isValid functionality in a transcoder that converts Strings to/from UTF-8, HTML Formdata (MIME type application/x-www-form-urlencoded -- not the same as URI encoding!), and Base64. Admittedly these currently use 'encodeURI' to do the work, or it just drops out naturally when considering UTF-8 sequences. (I considered testing the regexp /^(?:[\u-\uD7FF\uE000-\u]|[\uD800-\uDBFF][\uDC00-\uDFFF])*$/ against the input string.) Maybe the function is too obscure for general use, although its presence does flag up the surrogate-pair issue to developers. -- From: Norbert Lindenberg ecmascr...@norbertlindenberg.com It's easy to provide this function, but in which situations would it be useful? In most cases that I can think of you're interested in far more constrained definitions of validity: - what are valid ECMAScript identifiers? - what are valid BCP 47 language tags? - what are the characters allowed in a certain protocol? - what are the characters that my browser can render? Thanks, Norbert On Mar 24, 2012, at 12:12 , David Herman wrote: On Mar 23, 2012, at 11:45 AM, Roger Andrews wrote: Concerning UTF-16 surrogate pairs, how about a function like: String.isValid( str ) to discover whether surrogates are used correctly in 'str'? Something like Array.isArray(). No need for it to be a class method, since it only operates on strings. We could simply have String.prototype.isValid(). Note that it would work for primitive strings as well, thanks to JS's automatic promotion semantics. Dave ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
Doesn't C/C++ allow non-BMP code points using \U in character literals. The \U format expresses a full 32-bit code, which could be mapped internally to two 16-bit UTF-16 codes. Then the programmer can describe exactly the required characters without caring about their coding in UTF-16 or whatever. Could you use this to avoid complicated things in RegExps like [{\u\u}-{\u\u}], instead have things like [\U0001-\U0003] -- naturally expressing the characters of interest? The same goes for String literals, where the programmer does not really care about the encoding, just specifying the character. (Sorry if I've missed something in the prior discussion.) -- From: Norbert Lindenberg To: David Herman On Mar 24, 2012, at 12:21 , David Herman wrote: [snip] As for whether the switch to code-point-based matching should be universal or require /u (an issue that your proposal leaves open), IMHO it's better to require /u since it avoids the need for transforming \u[\u-\u] to [{\u\u}-{\u\u}] and [\u-\u][\uDC00-\uDFFF] to [{\u\uDC00}-{\u\uDFFF}], and additionally avoids as least three potentially breaking changes (two of which are explicitly mentioned in your proposal): I haven't completely understood this part of the discussion. Looking at /u as a little red switch (LRS), i.e., an opportunity to make judicious breaks with compatibility, could we not allow character classes with unescaped non-BMP code points, e.g.: js 팆팇팈팉팊.match(/[팆-퍖]+/u) [팆팇팈팉팊] I'm still getting up to speed on Unicode and JS string semantics, so I'm guessing that I'm missing a reason why that wouldn't work... Presumably the JS source of the regexp literal, as a sequence of UTF-16 code units, represents the tetragram code points as surrogate pairs. Can we not recognize surrogate pairs in character classes within a /u regexp and interpret them as code points? With /u, that's exactly what happens. My first proposal was to make this happen even without a new flag, i.e., make 팆팇팈팉팊.match(/[팆-퍖]+/) work based on code points, and Steve is arguing against that because of compatibility risk. My proposal also includes some transformations to keep existing regular expressions working, and Steve correctly observes that if we have a flag for code point mode, then the transformation is not needed - old regular expressions would continue to work in code unit mode, while new regular expressions with /u get code point treatment. ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
Just confirmed C/C++ do allow \U escaped characters for non-BMP code points in string literals. Interesting page at: http://publib.boulder.ibm.com/infocenter/comphelp/v7v91/topic/com.ibm.vacpp7a.doc/language/ref/clrc02unicode_standard.htm So C/C++ has: \xNN 8-bit character (U+ - U+00FF) \u16-bit character \U 32-bit character This naturally expresses any character, without worrying about the UTF-16 or whatever encoding. -- From: Roger Andrews To: Norbert Lindenberg Doesn't C/C++ allow non-BMP code points using \U in character literals. The \U format expresses a full 32-bit code, which could be mapped internally to two 16-bit UTF-16 codes. Then the programmer can describe exactly the required characters without caring about their coding in UTF-16 or whatever. Could you use this to avoid complicated things in RegExps like [{\u\u}-{\u\u}], instead have things like [\U0001-\U0003] -- naturally expressing the characters of interest? The same goes for String literals, where the programmer does not really care about the encoding, just specifying the character. (Sorry if I've missed something in the prior discussion.) -- From: Norbert Lindenberg To: David Herman On Mar 24, 2012, at 12:21 , David Herman wrote: [snip] As for whether the switch to code-point-based matching should be universal or require /u (an issue that your proposal leaves open), IMHO it's better to require /u since it avoids the need for transforming \u[\u-\u] to [{\u\u}-{\u\u}] and [\u-\u][\uDC00-\uDFFF] to [{\u\uDC00}-{\u\uDFFF}], and additionally avoids as least three potentially breaking changes (two of which are explicitly mentioned in your proposal): I haven't completely understood this part of the discussion. Looking at /u as a little red switch (LRS), i.e., an opportunity to make judicious breaks with compatibility, could we not allow character classes with unescaped non-BMP code points, e.g.: js 팆팇팈팉팊.match(/[팆-퍖]+/u) [팆팇팈팉팊] I'm still getting up to speed on Unicode and JS string semantics, so I'm guessing that I'm missing a reason why that wouldn't work... Presumably the JS source of the regexp literal, as a sequence of UTF-16 code units, represents the tetragram code points as surrogate pairs. Can we not recognize surrogate pairs in character classes within a /u regexp and interpret them as code points? With /u, that's exactly what happens. My first proposal was to make this happen even without a new flag, i.e., make 팆팇팈팉팊.match(/[팆-퍖]+/) work based on code points, and Steve is arguing against that because of compatibility risk. My proposal also includes some transformations to keep existing regular expressions working, and Steve correctly observes that if we have a flag for code point mode, then the transformation is not needed - old regular expressions would continue to work in code unit mode, while new regular expressions with /u get code point treatment. ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
On Mar 23, 2012, at 11:45 AM, Roger Andrews wrote: Concerning UTF-16 surrogate pairs, how about a function like: String.isValid( str ) to discover whether surrogates are used correctly in 'str'? Something like Array.isArray(). No need for it to be a class method, since it only operates on strings. We could simply have String.prototype.isValid(). Note that it would work for primitive strings as well, thanks to JS's automatic promotion semantics. Dave ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
On Mar 23, 2012, at 6:30 AM, Steven Levithan wrote: I've been wondering whether it might be best for the /u flag to do three things at once, making it an all-around support Unicode better flag: +all my internet points Now you're talking!! 1. Switches from code unit to code point mode. /./gu matches any Unicode code point, among other benefits outlined by Norbert. 2. Makes \d\D\w\W\b\B match Unicode decimal digits and word characters. [0-9], [A-Za-z0-9_], and lookaround provide fallbacks if you want to match ASCII characters only while using /u. 3. [New proposal] Makes /i use Unicode casefolding rules. /ΣΤΙΓΜΑΣ/iu.test(στιγμας) == true. This is really exciting. As for whether the switch to code-point-based matching should be universal or require /u (an issue that your proposal leaves open), IMHO it's better to require /u since it avoids the need for transforming \u[\u-\u] to [{\u\u}-{\u\u}] and [\u-\u][\uDC00-\uDFFF] to [{\u\uDC00}-{\u\uDFFF}], and additionally avoids as least three potentially breaking changes (two of which are explicitly mentioned in your proposal): I haven't completely understood this part of the discussion. Looking at /u as a little red switch (LRS), i.e., an opportunity to make judicious breaks with compatibility, could we not allow character classes with unescaped non-BMP code points, e.g.: js 팆팇팈팉팊.match(/[팆-퍖]+/u) [팆팇팈팉팊] I'm still getting up to speed on Unicode and JS string semantics, so I'm guessing that I'm missing a reason why that wouldn't work... Presumably the JS source, as a sequence of UTF-16 code units, represents the tetragram code points as surrogate pairs. Can we not recognize surrogate pairs in character classes within a /u regexp and interpret them as code points? Dave ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
Presumably the JS source, as a sequence of UTF-16 code units, represents the tetragram code points as surrogate pairs. Clarification: the JS source *of the regexp literal*. Dave ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
On 24 March 2012 17:22, David Herman dher...@mozilla.com wrote: I'm not 100% clear on this point yet, but e.g. the SourceCharacter production in Annex A.1 is described as any Unicode code unit. Ugh, IMHO, that's wrong, and should be any Unicode code point. (let the flames begin?) The underlying transport format should not be a concern for the JS lexer. eval Eval is a red herring: its input is defined as the contents of the given String. So, we come full-circle back to what's in a String?. I'm still partial to Brendan's BRS idea, because at least it fixes everything all at once. Wes -- Wesley W. Garland Director, Product Development PageMail, Inc. +1 613 542 2787 x 102 ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
On Mar 23, 2012, at 6:30 , Steven Levithan wrote: Norbert Lindenberg wrote: I've updated the proposal based on the feedback received so far. Changes are listed in the Updates section. http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/ Cool. From the proposal's Updates section: Indicated that u may not be the actual character for the flag for code point mode in regular expressions, as a u flag has already been proposed for Unicode-aware digit and word character matching. I've been wondering whether it might be best for the /u flag to do three things at once, making it an all-around support Unicode better flag: 1. Switches from code unit to code point mode. /./gu matches any Unicode code point, among other benefits outlined by Norbert. 2. Makes \d\D\w\W\b\B match Unicode decimal digits and word characters. [0-9], [A-Za-z0-9_], and lookaround provide fallbacks if you want to match ASCII characters only while using /u. One concern: I think code point based matching should be the default for regex literals within modules (where we know the code is written for Harmony). Does it make sense to also interpret \d\D\w\W\b\B as full Unicode sets for such literals? In the other direction it's clear that using /u for \d\D\w\W\b\B has to imply code point mode. 3. [New proposal] Makes /i use Unicode casefolding rules. /ΣΤΙΓΜΑΣ/iu.test(στιγμας) == true. We probably should review the complete Unicode Technical Standard #18, Unicode Regular Expressions, and see how we can upgrade RegExp for better Unicode support. Maybe on a separate thread... Item number 3 is inspired by but different than Java's lowercase u flag for Unicode casefolding. In Java, flag u itself enables Unicode casefolding and does not need to be paired with flag i (which is equivalent to ES's /i). As an aside, merging these three things would likely lead to /u seeing widespread use when dealing with anything more than ASCII, at least in environments where you don't have to worry about backcompat. This would help developers avoid stumbling on code unit issues in the small minority of cases where non-BMP characters are used or encountered. If /u's only purpose was to switch to code point mode, most likely it would be used *far* less often, and more developers would continue to get bitten by code-unit-based processing. Good thinking :-) As for whether the switch to code-point-based matching should be universal or require /u (an issue that your proposal leaves open), IMHO it's better to require /u since it avoids the need for transforming \u[\u-\u] to [{\u\u}-{\u\u}] and [\u-\u][\uDC00-\uDFFF] to [{\u\uDC00}-{\u\uDFFF}], and additionally avoids as least three potentially breaking changes (two of which are explicitly mentioned in your proposal): 1. [S]ome applications might have processed gunk with regular expressions where neither the 'characters' in the patterns nor the input to be matched are text. 2. s.match(/^.$/)[0].length can now be 2. I'll add, /.{3}/.exec(s)[0].length can now be anywhere between 3 and 6. 3. /./g.exec(s) can now increment the regex's lastIndex by 2. -- Steven Levithan ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
On Mar 23, 2012, at 7:12 , Lasse Reichstein wrote: On Fri, Mar 23, 2012 at 2:30 PM, Steven Levithan steves_l...@hotmail.com wrote: I've been wondering whether it might be best for the /u flag to do three things at once, making it an all-around support Unicode better flag: ... 3. [New proposal] Makes /i use Unicode casefolding rules. Yey, I'm for it :) Especially if it means dropping the rather naïve canonicalize function that can't canonicalize an ASCII character with a non-ASCII character. /ΣΤΙΓΜΑΣ/iu.test(στιγμας) == true. I think a compliant implementation should (read: ought to) already get that example, since στιγμας.toUpperCase() == ΣΤΙΓΜΑΣ.toUpperCase() in the browsers I have checked, and the ignore-case canonicalization is based on toUpperCase. Alas, most of the implementations miss it anyway. According to the ES5 spec, /ΣΤΙΓΜΑΣ/i.test(στιγμας) must be true indeed. Chrome and Node (i.e., V8) and IE get this right; Safari, Firefox, and Opera don't. Note that toUpperCase allows mappings from 1 to multiple code units, while RegExp canonicalization in ES5 doesn't, so /SS/i.test(ß) === false even though SS.toUpperCase() === ß.toUpperCase(). Norbert ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
Thanks for the detailed comments! Replies below. Norbert On Mar 23, 2012, at 9:46 , Phillips, Addison wrote: Comments follow. 1. Definition of string. You say: -- However, ECMAScript does not place any restrictions or requirements on the sequence of code units in a String value, so it may be ill-formed when interpreted as a UTF-16 code unit sequence. -- I know what you mean, but others might not. Perhaps: -- However, ECMAScript does not place any restrictions or requirements on the sequence of code units in a String value, so the sequence of code units may contain code units that are not valid in Unicode or sequences that do not represent Unicode code points (such as unpaired surrogates). -- I can add a note that ill-formed here means containing unpaired surrogates. If I read chapter 3 of the Unicode Standard correctly, there's no other way for UTF-16 to be ill-formed. UTF-16 code units by themselves cannot be invalid - any 16-bit value can occur in a well-formed UTF-16 string. 2. In this section, I would define string after code unit and code point. I would also include a definition of surrogates/surrogate pairs. Makes sense. 3. Under text interpretation you say: -- For compatibility with existing applications, it has to allow surrogate code points (code points between U+D800 and U+DFFF which can never represent characters). -- This would (see above) benefit from having a definition in place. As noted, this is slightly incomplete, since surrogate code units are used to form supplementary characters. The text is about surrogate code points, not about surrogate code units. 4. 0xFFFE and 0x are non-characters in Unicode. I do think you do the right thing here. It's just a nit that you never note this ;-). 5. Editorial unnecessary ;-): -- This transformation is rather ugly, but I’m afraid it’s the price ECMAScript has to pay for being 12 years late in supporting supplementary characters. -- 6. Under 'details' you suggest a number of renamings. Are these strictly necessary? The term 'character' could be taken to mean 'code point' instead, with an explanatory note. Unfortunately, the term character is poisoned in ES5 by a redefinition as code unit (chapter 6). For ES6, I'd like the spec to be really clear where it means code units and where it means code points. Maybe we can then reintroduce character in ES7... 7. Skipping down a lot, to section 6 source text, you propose: -- The text is expected to have been normalised to Unicode Normalization Form C (Canonical Decomposition, followed by Canonical Composition), as described in Unicode Standard Annex #15. -- I think this should be removed or modified. This sentence is essentially copied from ES5 (with corrected references), and as I copied it, I made a note to myself that we need to discuss normalization, just not as part of this proposal... Automatic application of NFC is not always desirable, as it can affect presentation or processing. Perhaps: -- Normalization of the text to Unicode Normalization Form C (Canonical Decomposition, followed by Canonical Composition), as described in Unicode Standard Annex #15, is recommended when transcoding from another character encoding. -- 8. In 7.6 Identifier Names and Identifiers you don't actually forbid unpaired surrogates or non-characters in the text (Identifier_Part:: does this by implication). Perhaps state it? Also, ZWJ and ZWNJ are permitted as the last character in an identifier. I can add a note about surrogate code points and non-characters, but, as you say, they are already ruled out because they can't have the required Unicode properties ID_Start or ID_Continue. The use of ZWJ and ZWNJ is unchanged from ES5. UAX 31 has much stricter rules on where they would be allowed, but I'm not sure we have a strong case for changing the rules in ECMAScript. http://www.unicode.org/reports/tr31/tr31-9.html#Layout_and_Format_Control_Characters 9. 15.5.4.6: you say (a nonnegative integer less than 0x10), whereas it should say: (a nonnegative integer less than or equal to 0x10) Will fix. 10. In the section on what about utf-32, you say: and the code points start at positions 1, 2, 3.. Of course this should be ... and the code points start at positions 0, 1, 2. Of course. Thanks for this proposal! Addison -Original Message- From: Norbert Lindenberg [mailto:ecmascr...@norbertlindenberg.com] Sent: Thursday, March 22, 2012 10:14 PM To: es-discuss@mozilla.org Subject: Re: Full Unicode based on UTF-16 proposal I've updated the proposal based on the feedback received so far. Changes are listed in the Updates section. http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/ Norbert On Mar 16, 2012, at 0:18 , Norbert Lindenberg wrote: Based on my prioritization of goals for support for full Unicode
Re: Full Unicode based on UTF-16 proposal
Norbert Lindenberg wrote: I've updated the proposal based on the feedback received so far. Changes are listed in the Updates section. http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/ Cool. From the proposal's Updates section: Indicated that u may not be the actual character for the flag for code point mode in regular expressions, as a u flag has already been proposed for Unicode-aware digit and word character matching. I've been wondering whether it might be best for the /u flag to do three things at once, making it an all-around support Unicode better flag: 1. Switches from code unit to code point mode. /./gu matches any Unicode code point, among other benefits outlined by Norbert. 2. Makes \d\D\w\W\b\B match Unicode decimal digits and word characters. [0-9], [A-Za-z0-9_], and lookaround provide fallbacks if you want to match ASCII characters only while using /u. 3. [New proposal] Makes /i use Unicode casefolding rules. /ΣΤΙΓΜΑΣ/iu.test(στιγμας) == true. Item number 3 is inspired by but different than Java's lowercase u flag for Unicode casefolding. In Java, flag u itself enables Unicode casefolding and does not need to be paired with flag i (which is equivalent to ES's /i). As an aside, merging these three things would likely lead to /u seeing widespread use when dealing with anything more than ASCII, at least in environments where you don't have to worry about backcompat. This would help developers avoid stumbling on code unit issues in the small minority of cases where non-BMP characters are used or encountered. If /u's only purpose was to switch to code point mode, most likely it would be used *far* less often, and more developers would continue to get bitten by code-unit-based processing. As for whether the switch to code-point-based matching should be universal or require /u (an issue that your proposal leaves open), IMHO it's better to require /u since it avoids the need for transforming \u[\u-\u] to [{\u\u}-{\u\u}] and [\u-\u][\uDC00-\uDFFF] to [{\u\uDC00}-{\u\uDFFF}], and additionally avoids as least three potentially breaking changes (two of which are explicitly mentioned in your proposal): 1. [S]ome applications might have processed gunk with regular expressions where neither the 'characters' in the patterns nor the input to be matched are text. 2. s.match(/^.$/)[0].length can now be 2. I'll add, /.{3}/.exec(s)[0].length can now be anywhere between 3 and 6. 3. /./g.exec(s) can now increment the regex's lastIndex by 2. -- Steven Levithan ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
On Fri, Mar 23, 2012 at 2:30 PM, Steven Levithan steves_l...@hotmail.com wrote: I've been wondering whether it might be best for the /u flag to do three things at once, making it an all-around support Unicode better flag: ... 3. [New proposal] Makes /i use Unicode casefolding rules. Yey, I'm for it :) Especially if it means dropping the rather naïve canonicalize function that can't canonicalize an ASCII character with a non-ASCII character. /ΣΤΙΓΜΑΣ/iu.test(στιγμας) == true. I think a compliant implementation should (read: ought to) already get that example, since στιγμας.toUpperCase() == ΣΤΙΓΜΑΣ.toUpperCase() in the browsers I have checked, and the ignore-case canonicalization is based on toUpperCase. Alas, most of the implementations miss it anyway. /L ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
RE: Full Unicode based on UTF-16 proposal
Comments follow. 1. Definition of string. You say: -- However, ECMAScript does not place any restrictions or requirements on the sequence of code units in a String value, so it may be ill-formed when interpreted as a UTF-16 code unit sequence. -- I know what you mean, but others might not. Perhaps: -- However, ECMAScript does not place any restrictions or requirements on the sequence of code units in a String value, so the sequence of code units may contain code units that are not valid in Unicode or sequences that do not represent Unicode code points (such as unpaired surrogates). -- 2. In this section, I would define string after code unit and code point. I would also include a definition of surrogates/surrogate pairs. 3. Under text interpretation you say: -- For compatibility with existing applications, it has to allow surrogate code points (code points between U+D800 and U+DFFF which can never represent characters). -- This would (see above) benefit from having a definition in place. As noted, this is slightly incomplete, since surrogate code units are used to form supplementary characters. Perhaps: -- For compatibility with existing applications, it has to allow surrogate code points (code points between U+D800 and U+DFFF which do not individually represent characters). -- 4. 0xFFFE and 0x are non-characters in Unicode. I do think you do the right thing here. It's just a nit that you never note this ;-). 5. Editorial unnecessary ;-): -- This transformation is rather ugly, but I’m afraid it’s the price ECMAScript has to pay for being 12 years late in supporting supplementary characters. -- 6. Under 'details' you suggest a number of renamings. Are these strictly necessary? The term 'character' could be taken to mean 'code point' instead, with an explanatory note. 7. Skipping down a lot, to section 6 source text, you propose: -- The text is expected to have been normalised to Unicode Normalization Form C (Canonical Decomposition, followed by Canonical Composition), as described in Unicode Standard Annex #15. -- I think this should be removed or modified. Automatic application of NFC is not always desirable, as it can affect presentation or processing. Perhaps: -- Normalization of the text to Unicode Normalization Form C (Canonical Decomposition, followed by Canonical Composition), as described in Unicode Standard Annex #15, is recommended when transcoding from another character encoding. -- 8. In 7.6 Identifier Names and Identifiers you don't actually forbid unpaired surrogates or non-characters in the text (Identifier_Part:: does this by implication). Perhaps state it? Also, ZWJ and ZWNJ are permitted as the last character in an identifier. 9. 15.5.4.6: you say (a nonnegative integer less than 0x10), whereas it should say: (a nonnegative integer less than or equal to 0x10) 10. In the section on what about utf-32, you say: and the code points start at positions 1, 2, 3.. Of course this should be ... and the code points start at positions 0, 1, 2. Thanks for this proposal! Addison -Original Message- From: Norbert Lindenberg [mailto:ecmascr...@norbertlindenberg.com] Sent: Thursday, March 22, 2012 10:14 PM To: es-discuss@mozilla.org Subject: Re: Full Unicode based on UTF-16 proposal I've updated the proposal based on the feedback received so far. Changes are listed in the Updates section. http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/ Norbert On Mar 16, 2012, at 0:18 , Norbert Lindenberg wrote: Based on my prioritization of goals for support for full Unicode in ECMAScript [1], I've put together a proposal for supporting the full Unicode character set based on the existing representation of text in ECMAScript using UTF-16 code unit sequences: http://norbertlindenberg.com/2012/03/ecmascript-supplementary- characters/index.html The detailed proposed spec changes serve to get a good idea of the scope of the changes, but will need some polishing. Comments? Thanks, Norbert [1] https://mail.mozilla.org/pipermail/es-discuss/2012-February/020721.html ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
Concerning UTF-16 surrogate pairs, how about a function like: String.isValid( str ) to discover whether surrogates are used correctly in 'str'? Something like Array.isArray(). Nb. Already encodeURI throws an URIError exception if 'str' is not a well-formed UTF-16 string. - 1. Definition of string. You say: -- However, ECMAScript does not place any restrictions or requirements on the sequence of code units in a String value, so it may be ill-formed when interpreted as a UTF-16 code unit sequence. -- I know what you mean, but others might not. Perhaps: -- However, ECMAScript does not place any restrictions or requirements on the sequence of code units in a String value, so the sequence of code units may contain code units that are not valid in Unicode or sequences that do not represent Unicode code points (such as unpaired surrogates). -- ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
I've updated the proposal based on the feedback received so far. Changes are listed in the Updates section. http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/ Norbert On Mar 16, 2012, at 0:18 , Norbert Lindenberg wrote: Based on my prioritization of goals for support for full Unicode in ECMAScript [1], I've put together a proposal for supporting the full Unicode character set based on the existing representation of text in ECMAScript using UTF-16 code unit sequences: http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/index.html The detailed proposed spec changes serve to get a good idea of the scope of the changes, but will need some polishing. Comments? Thanks, Norbert [1] https://mail.mozilla.org/pipermail/es-discuss/2012-February/020721.html ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
Steven Levithan wrote: \w with Unicode should match [\p{L}\{Nd}_]. The best way to go for [[:alnum:]], for compatibility reasons, would probably be [\p{Ll}\p{Lu}\p{Lt}\p{Nd}]. This difference could be argued as a positive (if you like that exact set) or a negative (many users will think it's equivalent to \w with Unicode even though it isn't). Although some regex libraries indeed implement the above, I've just looked over UTS#18 Annex C [1], which requires that \w be equivalent to: [\p{Alphabetic}\p{M}\p{Nd}\p{Pc}] Note that \p{Alphabetic} should include more than just \p{L}. I'm not clear on whether the differences from \p{L} are fully covered by the inclusion of \p{M} in the above character class. I'm sure there are plenty of people here with greater Unicode expertise than me who could clarify, though. -- Steven Levithan [1]: http://unicode.org/reports/tr18/#Compatibility_Properties ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
Java SE 7 apparently added flag (?U) to do the same thing as Python's (?u). The new flag also affects Java's POSIX character class definitions such as \p{Alnum}. Note the difference in casing, and also that Java's (?U)\w follows UTS#18, unlike Python's (?u)\w. Java has long supported a lowercase (?u) flag for Unicode-aware case folding. -- Steven Levithan -Original Message- From: Steven L. Sent: Monday, March 19, 2012 12:21 PM To: Erik Corry Cc: es-discuss@mozilla.org Subject: Re: Full Unicode based on UTF-16 proposal Steven Levithan wrote: \w with Unicode should match [\p{L}\{Nd}_]. The best way to go for [[:alnum:]], for compatibility reasons, would probably be [\p{Ll}\p{Lu}\p{Lt}\p{Nd}]. This difference could be argued as a positive (if you like that exact set) or a negative (many users will think it's equivalent to \w with Unicode even though it isn't). Although some regex libraries indeed implement the above, I've just looked over UTS#18 Annex C [1], which requires that \w be equivalent to: [\p{Alphabetic}\p{M}\p{Nd}\p{Pc}] Note that \p{Alphabetic} should include more than just \p{L}. I'm not clear on whether the differences from \p{L} are fully covered by the inclusion of \p{M} in the above character class. I'm sure there are plenty of people here with greater Unicode expertise than me who could clarify, though. -- Steven Levithan [1]: http://unicode.org/reports/tr18/#Compatibility_Properties ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
Steven Levithan wrote: * \s == [\x09-\x0D] -- Java, PCRE, Ruby, Python (default). * \s == [\x09–\x0D\p{Z}] -- ES-current, .NET, Perl, Python (with (?u)). Oops. My ASCII-only version of \s is obviously missing space \x20 and no-break space \xAO (which are included in Unicode's \p{Z}). Erik Corry wrote: Steven Levithan wrote: [:alnum:] in Perl, PCRE, Ruby, Tcl, POSIX/GNU BRE/ERE, etc. matches only [A-Za-z0-9]. Making it Unicode-based in ES would be confusing. This would be pretty useless and is not true in perl. I tried the following: perl -e use utf8; print 'æ' =~ /[[:alnum:]]/ . \\n\; and it prints 1, indicating a match. ***Updating my mental notes*** Roger that. Online docs (including the Perl-specific page you linked to earlier) typically list [:alnum:] as [A-Za-z0-9], but I've just done some quick testing and it seems that regex packages supporting [:alnum:] give it at least three different meanings: * [A-Za-z0-9] * [\p{Ll}\p{Lu}\p{Lt}\p{Nd}] * [\p{Ll}\p{Lu}\p{Lt}\p{Nd}\p{Nl}] Note that although Java doesn't support POSIX character class syntax, it too supports alnum via \p{Alnum}. Java's alnum matches only [A-Za-z0-9]. Anyway, this is probably all moot, unless someone wants to officially propose POSIX character classes for ES RegExp. ...In which case I'll be happy to state about a half-dozen reasons to not do so. :) Erik Corry wrote: OK, I'm convinced that /u should make \d, \b and \w Unicode aware. w00t! --Steven Levithan ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
2012/3/18 Steven L. steves_l...@hotmail.com: Anyway, this is probably all moot, unless someone wants to officially propose POSIX character classes for ES RegExp. ...In which case I'll be happy to state about a half-dozen reasons to not do so. :) Please do, they seem quite sensible to me. In fact \w with Unicode support seems very similar to [:alnum:] to me. If this one is useful are there not other Unicode categories that would be useful? -- Erik Corry ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
Erik Corry wrote: Steven Levithan wrote: Anyway, this is probably all moot, unless someone wants to officially propose POSIX character classes for ES RegExp. ...In which case I'll be happy to state about a half-dozen reasons to not do so. :) Please do, they seem quite sensible to me. My main objections are due to the POSIX character class syntax itself, and my preference for introducing Unicode categories using \p{..} instead. But to get down a little more detail... * They're backward incompatible. /[[:name:]]/ is currently equivalent to /[\[:aemn]\]/ in web-reality. Granted, this probably won't be a big deal for existing code, but because they're not currently an error, their use could cause latent bugs in old browsers that don't support them and treat them as part of a character class's set. * They work inside of bracket expressions only. This is clumsy and needlessly confusing. [:alnum:] outside of a bracket expression would probably have to continue to be equivalent to [:almnu], which would lead to at least occasional developer frustration and bugs. * Since the exact characters they match differs between regex libraries (beyond just Unicode version variation), they would contribute to the existing landscape of regex features that seem to be portable but actually work slightly differently in different places. We need less of this. * They are either rarely useful or only minor conveniences over existing shorthands, explicit character classes, or Unicode categories that could be matched using \p{..} in more standardized fashion. * Other implementations, at least, do not allow them to be negated on their own, unlike \p{..} (via \P{..} or \p{^..}). They can be negated by using them in negated bracket expressions, but that may negate more than you want. * If ES ever adopts .NET/XPath-style character class subtraction or Java-style character class intersection (the latter was on the cards for ES4), their syntax would become even more confusing. * Bonus pompous bullet point: IMO, there are more useful and important new RegExp features to focus on, including support for Unicode categories (which, IMO, are regex's new and improved version of POSIX character classes). My personal wishlist would probably include at least 20 new regex features above POSIX character classes, even if they were introduced using the \p{..} syntax (which is how Java included them). * Bonus nitpick: The name of the syntax itself causes confusion. POSIX calls them character classes, and calls their container a bracket expression. JavaScripters already call the container a character class. (Not an objection, per se. Presumably we could call them something like POSIX shorthands to avoid confusion.) I'd have no actual objections to adding them using the \p{Name} syntax (as Java does), especially if there is demand for them among regex power-users (you're the first person who I've seen strongly advocate for them). However, I'd still have concerns about exactly which names are added, exactly what they match, and their compatibility with other regex flavors. In fact \w with Unicode support seems very similar to [:alnum:] to me. If this one is useful are there not other Unicode categories that would be useful? \w with Unicode should match [\p{L}\{Nd}_]. The best way to go for [[:alnum:]], for compatibility reasons, would probably be [\p{Ll}\p{Lu}\p{Lt}\p{Nd}]. This difference could be argued as a positive (if you like that exact set) or a negative (many users will think it's equivalent to \w with Unicode even though it isn't). As you said, though, Unicode categories are indeed quite useful. Unicode scripts, too. I'd advocate for them alongside you. Because of how useful they are, I've even made them usable via my XRegExp JavaScript library (see http://git.io/xregexp ). That lib has a relatively small but enthusiastic user base and is seeing increasing use in server-side JS, where the overhead of loading long Unicode code point ranges doesn't matter as much. But, so long as a /u flag is added for switching \w\b\d to Unicode-mode, I'd argue that even Unicode categories and scripts are less important than various other features I've mentioned recently on es-discuss, including named capture and atomic groups. -- Steven Levithan ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
Eric Corry wrote: However I think we probably do want the /u modifier on regexps to control the new backward-incompatible behaviour. There may be some way to relax this for regexp literals in opted in Harmony code, but for new RegExp(...) and for other string literals I think there are rather too many inconsistencies with the old behaviour. Disagree with adding /u for this purpose and disagree with breaking backward compatibility to let `/./.exec(s)[0].length == 2`. Instead, if this is deemed an important enough issue, there are two ways to match any Unicode grapheme that match existing regex library precedent: From Perl and PCRE: \X From Perl, PCRE, .NET, Java, XML Schema, and ICU (among others): \P{M}\p{M}* Obviously \X is prettier, but because it's fairly rare for people to care about this, IMO the more widely compatible solution that uses Unicode categories is Good Enough if Unicode category syntax is on the table for ES6. Norbert Lindenberg wrote: \u[\u-\u] is interpreted as [\u\u-\u\u] [\u-\u][\u-\u] is interpreted as [\u\u-\u\u] This transformation is rather ugly, but I’m afraid it’s the price ECMAScript has to pay for being 12 years late in supporting supplementary characters. Yikes! -1! This is unnecessary if the handling of \u is unmodified and support for \u{h..} and/or \x{h..} is added (the latter is the syntax from Perl and PCRE). Some people will want a way to match arbitrary Unicode code points rather than graphemes anyway, so leaving \u alone lets that use case continue working. This would still allow modifying the handling of literal astral/supplementary characters in RegExps. If it can be handled sensibly, I'm all for treating literal characters in RegExps as discrete graphemes rather than splitting them into surrogate pairs. --Steven Levithan ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
2012/3/17 Steven L. steves_l...@hotmail.com: Eric Corry wrote: However I think we probably do want the /u modifier on regexps to control the new backward-incompatible behaviour. There may be some way to relax this for regexp literals in opted in Harmony code, but for new RegExp(...) and for other string literals I think there are rather too many inconsistencies with the old behaviour. Disagree with adding /u for this purpose and disagree with breaking backward compatibility to let `/./.exec(s)[0].length == 2`. Care to enlighten us with any thinking behind this disagreeing? Instead, if this is deemed an important enough issue, there are two ways to match any Unicode grapheme that match existing regex library precedent: From Perl and PCRE: \X This doesn't work inside []. Were you envisioning the same restriction in JS? Also it matches a grapheme cluster, which is may be useful but is completely different to what the dot does. From Perl, PCRE, .NET, Java, XML Schema, and ICU (among others): \P{M}\p{M}* Obviously \X is prettier, but because it's fairly rare for people to care about this, IMO the more widely compatible solution that uses Unicode categories is Good Enough if Unicode category syntax is on the table for ES6. Norbert Lindenberg wrote: \u[\u-\u] is interpreted as [\u\u-\u\u] Norbert, this just happens automatically if unmatched surrogates are just treated as if they were normal code units. [\u-\u][\u-\u] is interpreted as [\u\u-\u\u] Norbert, this will have different semantics to the current implementations unless the second range is the full trail surrogate range. I agree with Steven that these two cases should just be left alone, which means they will continue to work the way they have until now. Some people will want a way to match arbitrary Unicode code points rather than graphemes anyway, so leaving \u alone lets that use case continue working. This would still allow modifying the handling of literal astral/supplementary characters in RegExps. If it can be handled sensibly, I'm all for treating literal characters in RegExps as discrete graphemes rather than splitting them into surrogate pairs. You seem to be confusing graphemes and unicode code points. Here is the same text 3 times: Four UTF-16 code units: 0x0020 0xD800 0xDF30 0x0308 Three Unicode code points: 0x20 0x10330 0x308 Two Graphemes ¨ -- This is an attempt to show a Gothic Ahsa with an umlaut. My mail program probably screwed it up. The proposal you are responding to is all about adding Unicode code point handling to regexps. It is not about adding grapheme support, which is a rather different issue. -- Erik Corry ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
Eric Corry wrote: Disagree with adding /u for this purpose and disagree with breaking backward compatibility to let `/./.exec(s)[0].length == 2`. Care to enlighten us with any thinking behind this disagreeing? Sorry for the rushed and overly ebullient message. I disagreed with /u for switching from code unit to code point mode because in the moment I didn't think a code point mode necessary or particularly beneficial. Upon further reflection, I rushed into this opinion and will be more closely examining the related issues. I further objected because I think the /u flag would be better used as a ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on Python's re.UNICODE or (?u) flag, which does the same thing except that it also covers \s (which is already Unicode-based in ES). Therefore, I think that if a flag is added that only switches from code unit to code point mode, it should not be u. Presumably, flag /u could simultaneously affect \d\w\b and switch to code point mode. I haven't yet thought enough about combining these two proposals to hold a strong opinion on the matter. there are two ways to match any Unicode grapheme that match existing regex library precedent: From Perl and PCRE: \X This doesn't work inside []. Were you envisioning the same restriction in JS? Also it matches a grapheme cluster, which is may be useful but is completely different to what the dot does. You are of course correct. And yes, I was envisioning the same restriction within character classes. But I'm not a strong proponent of \X, especially if support for Unicode categories is added. I agree with Steven that these two cases should just be left alone, which means they will continue to work the way they have until now. Glad to hear it. You seem to be confusing graphemes and unicode code points. [...] The proposal you are responding to is all about adding Unicode code point handling to regexps. It is not about adding grapheme support, which is a rather different issue. Indeed. My response was rushed and poorly formed. My apologies. --Steven Levithan ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
Steven, sorry, I wasn't aware of your proposal for /u when I inserted the note on this flag into my proposal. My proposal was inspired by the use of /u in PHP, where it switches from byte mode to UTF-8 mode. We'll have to see whether it makes sense to combine the two under one flag or use two - fortunately, Unicode still has a few other characters. Norbert On Mar 17, 2012, at 11:22 , Steven L. wrote: Eric Corry wrote: Disagree with adding /u for this purpose and disagree with breaking backward compatibility to let `/./.exec(s)[0].length == 2`. Care to enlighten us with any thinking behind this disagreeing? Sorry for the rushed and overly ebullient message. I disagreed with /u for switching from code unit to code point mode because in the moment I didn't think a code point mode necessary or particularly beneficial. Upon further reflection, I rushed into this opinion and will be more closely examining the related issues. I further objected because I think the /u flag would be better used as a ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on Python's re.UNICODE or (?u) flag, which does the same thing except that it also covers \s (which is already Unicode-based in ES). Therefore, I think that if a flag is added that only switches from code unit to code point mode, it should not be u. Presumably, flag /u could simultaneously affect \d\w\b and switch to code point mode. I haven't yet thought enough about combining these two proposals to hold a strong opinion on the matter. there are two ways to match any Unicode grapheme that match existing regex library precedent: From Perl and PCRE: \X This doesn't work inside []. Were you envisioning the same restriction in JS? Also it matches a grapheme cluster, which is may be useful but is completely different to what the dot does. You are of course correct. And yes, I was envisioning the same restriction within character classes. But I'm not a strong proponent of \X, especially if support for Unicode categories is added. I agree with Steven that these two cases should just be left alone, which means they will continue to work the way they have until now. Glad to hear it. You seem to be confusing graphemes and unicode code points. [...] The proposal you are responding to is all about adding Unicode code point handling to regexps. It is not about adding grapheme support, which is a rather different issue. Indeed. My response was rushed and poorly formed. My apologies. --Steven Levithan ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
2012/3/17 Norbert Lindenberg ecmascr...@norbertlindenberg.com: Steven, sorry, I wasn't aware of your proposal for /u when I inserted the note on this flag into my proposal. My proposal was inspired by the use of /u in PHP, where it switches from byte mode to UTF-8 mode. We'll have to see whether it makes sense to combine the two under one flag or use two - fortunately, Unicode still has a few other characters. /foo/☃ // slash-unicode-snowman for the win! :-) -- Erik Corry P.S. I shudder to think what slash-pile-of-poo could mean. ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
2012/3/17 Steven L. steves_l...@hotmail.com: I further objected because I think the /u flag would be better used as a ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on Python's re.UNICODE or (?u) flag, which does the same thing except that it also covers \s (which is already Unicode-based in ES). I am rather skeptical about treating \d like this. I think any digit including rods and roman characters but not decimal points/commas http://en.wikipedia.org/wiki/Numerals_in_Unicode#Counting-rod_numerals would be needed much less often than the digits 0-9, so I think hijacking \d for this case is poor use of name space. The \d escape in perl does not cover other Unicode numerals, and even with the [:name:] syntax there appears to be no way to get the Unicode numerals: http://search.cpan.org/~flora/perl-5.14.2/pod/perlrecharclass.pod#POSIX_Character_Classes This suggests to me that it's not very useful. And instead of changing the meaning of \w, which will be confusing, I think that [:alnum:] as in perl would work fine. \b is a little tougher. The Unicode rewrite would be (?:(?![:alnum:])(?=[:alnum:])|(?=[:alnum:])(?![:alnum:])) which is obviously too verbose. But if we take \b for this then the ASCII version has to be written as (?:(?!\w)(?=\w)|(?=\w)(?!\w)) which is also more than a little annoying. However, often you don't need that if you have negative lookbehind because you can write something like /(?!\w)word(?=!\w)/// Negative look-behind for a \w and negative look-ahead for \w at the end. which isn't _too_ bad, even if it is much worse than /\bword\b/ Indeed. My response was rushed and poorly formed. My apologies. Gratefully accepted with the hope that my next rushed and poorly formed response will also be forgiven! -- Erik Corry ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
On Mar 17, 2012, at 10:20 , Erik Corry wrote: 2012/3/17 Steven L. steves_l...@hotmail.com: Eric Corry wrote: However I think we probably do want the /u modifier on regexps to control the new backward-incompatible behaviour. There may be some way to relax this for regexp literals in opted in Harmony code, but for new RegExp(...) and for other string literals I think there are rather too many inconsistencies with the old behaviour. Disagree with adding /u for this purpose and disagree with breaking backward compatibility to let `/./.exec(s)[0].length == 2`. Care to enlighten us with any thinking behind this disagreeing? Instead, if this is deemed an important enough issue, there are two ways to match any Unicode grapheme that match existing regex library precedent: From Perl and PCRE: \X This doesn't work inside []. Were you envisioning the same restriction in JS? Also it matches a grapheme cluster, which is may be useful but is completely different to what the dot does. From Perl, PCRE, .NET, Java, XML Schema, and ICU (among others): \P{M}\p{M}* Obviously \X is prettier, but because it's fairly rare for people to care about this, IMO the more widely compatible solution that uses Unicode categories is Good Enough if Unicode category syntax is on the table for ES6. Norbert Lindenberg wrote: \u[\u-\u] is interpreted as [\u\u-\u\u] Norbert, this just happens automatically if unmatched surrogates are just treated as if they were normal code units. I don't see how. In the actual matching process, the new design only looks at code points, not code units. Without this transformation, it would see surrogate code points in the pattern, but supplementary code points in the text to be matched. Enhancing the matching process to recognize surrogate code points and insert them into the continuation might work, but wouldn't be any prettier than this transformation. [\u-\u][\u-\u] is interpreted as [\u\u-\u\u] Norbert, this will have different semantics to the current implementations unless the second range is the full trail surrogate range. True. I think if we restrict the transformation to that specific case it'll still cover normal usage of this pattern. I agree with Steven that these two cases should just be left alone, which means they will continue to work the way they have until now. Some people will want a way to match arbitrary Unicode code points rather than graphemes anyway, so leaving \u alone lets that use case continue working. This would still allow modifying the handling of literal astral/supplementary characters in RegExps. If it can be handled sensibly, I'm all for treating literal characters in RegExps as discrete graphemes rather than splitting them into surrogate pairs. You seem to be confusing graphemes and unicode code points. Here is the same text 3 times: Four UTF-16 code units: 0x0020 0xD800 0xDF30 0x0308 Three Unicode code points: 0x20 0x10330 0x308 Two Graphemes ¨ -- This is an attempt to show a Gothic Ahsa with an umlaut. My mail program probably screwed it up. Mac Mail is usually Unicode-friendly, so let's try again: ̰̈ The proposal you are responding to is all about adding Unicode code point handling to regexps. It is not about adding grapheme support, which is a rather different issue. Correct - thanks for the explanation! ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
On Mar 17, 2012, at 11:58 , Erik Corry wrote: 2012/3/17 Steven L. steves_l...@hotmail.com: I further objected because I think the /u flag would be better used as a ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on Python's re.UNICODE or (?u) flag, which does the same thing except that it also covers \s (which is already Unicode-based in ES). I am rather skeptical about treating \d like this. I think any digit including rods and roman characters but not decimal points/commas http://en.wikipedia.org/wiki/Numerals_in_Unicode#Counting-rod_numerals would be needed much less often than the digits 0-9, so I think hijacking \d for this case is poor use of name space. The \d escape in perl does not cover other Unicode numerals, and even with the [:name:] syntax there appears to be no way to get the Unicode numerals: http://search.cpan.org/~flora/perl-5.14.2/pod/perlrecharclass.pod#POSIX_Character_Classes This suggests to me that it's not very useful. Looking at that page, it seems \d gives you a reasonable set of digits, the ones in the Unicode general category Nd (number, decimal). These digits come from a variety of writing systems, but are all used decimal-positional, so you can parse at least integers using them with a fairly generic algorithm. Dealing with roman numerals or counting rods requires specialized algorithms, so you probably don't want to find them in this bucket. Norbert ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
Eric Corry wrote: I further objected because I think the /u flag would be better used as a ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on Python's re.UNICODE or (?u) flag, which does the same thing except that it also covers \s (which is already Unicode-based in ES). I am rather skeptical about treating \d like this. I think any digit including rods and roman characters but not decimal points/commas http://en.wikipedia.org/wiki/Numerals_in_Unicode#Counting-rod_numerals would be needed much less often than the digits 0-9, so I think hijacking \d for this case is poor use of name space. The \d escape in perl does not cover other Unicode numerals, and even with the [:name:] syntax there appears to be no way to get the Unicode numerals: http://search.cpan.org/~flora/perl-5.14.2/pod/perlrecharclass.pod#POSIX_Character_Classes This suggests to me that it's not very useful. I know from experience that it's common for Arabic speakers to want to match both 0-9 and Arabic-Indic digits. The same seems true for Hindi/Devanagari digits, and probably others. Even if it wasn't often useful, IMO this change is necessary for congruity with Unicode-enabled \w and \b (I'll get to that), and would likely never be detrimental since /u would be opt-in and it's easy to explicitly use [0-9] when that's what you want. For the record, I am proposing that /\d/u be equivalent to /\p{Nd}/, not /\p{N}/. I.e., it should not match any Unicode number, but rather any Unicode decimal digit (see http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5Cp%7BNd%7D for the list). And as Norbert noted, that is in fact what Perl's \d matches. Comparison with other regex flavors: * \w == [A-Za-z0-9_] -- ES-current, Java, PCRE, Ruby, Python (default). * \w == [\p{L}\p{Nd}_] -- .NET, Perl, Python (with (?u)). * \b matches between ASCII \w\W -- ES-current, PCRE, Ruby, Python (default). * \b matches between Unicode \w\W -- Java, .NET, Perl, Python (with (?u)). * \d == [0-9] -- ES-current, Java, PCRE, Ruby, Python (default). * \d == \p{Nd} -- .NET, Perl, Python (with (?u)). * \s == [\x09-\x0D] -- Java, PCRE, Ruby, Python (default). * \s == [\x09–\x0D\p{Z}] -- ES-current, .NET, Perl, Python (with (?u)). Note that Java's \w and \b are inconsistent. Unicode-based \w and \b are incredibly useful, and it is very common for users to sometimes want them to be Unicode-based--thus, an opt-in flag offers the best of both worlds. In fact, I'd go so far as to say they are broken without Unicode support. Consider, e.g., /a\b/.test('naïve'), which currently returns true. Unicode-based \d would not only help international users/apps, it is also important because otherwise Unicode-based \w and \b would have to use [\p{L}0-9_] rather than [\p{L}\p{Nd}_], which breaks portability with .NET, Perl, Python, and Java. If, conversely, Unicode-enabled \w and \b used [\p{L}\p{Nd}_] but \d used [0-9], then among other consequences (including user confusion), [^\W\d_] could not be used equivalently to \p{L}. And instead of changing the meaning of \w, which will be confusing, I think that [:alnum:] as in perl would work fine. [:alnum:] in Perl, PCRE, Ruby, Tcl, POSIX/GNU BRE/ERE, etc. matches only [A-Za-z0-9]. Making it Unicode-based in ES would be confusing. It also works only within character classes. IMO, the POSIX-style [[:name:]] syntax is clumsy and confusing, not to mention backward incompatible. It would potentially also be confusing if ES supports only [:alnum:] without adding the rest of the (not-very-useful) POSIX regex class names. \b is a little tougher. The Unicode rewrite would be (?:(?![:alnum:])(?=[:alnum:])|(?=[:alnum:])(?![:alnum:])) which is obviously too verbose. But if we take \b for this then the ASCII version has to be written as (?:(?!\w)(?=\w)|(?=\w)(?!\w)) which is also more than a little annoying. However, often you don't need that if you have negative lookbehind because you can write something like /(?!\w)word(?=!\w)/// Negative look-behind for a \w and negative look-ahead for \w at the end. which isn't _too_ bad, even if it is much worse than /\bword\b/ I've already started to explain above why I think Unicode-based \b is important and useful. I'll just add the footnote that relying on lookbehind would in all likelihood perform less efficiently than \b (depending on implementation optimizations). Indeed. My response was rushed and poorly formed. My apologies. Gratefully accepted with the hope that my next rushed and poorly formed response will also be forgiven! Consider it done. ;-P --Steven Levithan ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
2012/3/18 Steven L. steves_l...@hotmail.com: Eric Corry wrote: I further objected because I think the /u flag would be better used as a ASCII/Unicode mode switcher for \d\w\b. My proposal for this is based on Python's re.UNICODE or (?u) flag, which does the same thing except that it also covers \s (which is already Unicode-based in ES). I am rather skeptical about treating \d like this. I think any digit including rods and roman characters but not decimal points/commas http://en.wikipedia.org/wiki/Numerals_in_Unicode#Counting-rod_numerals would be needed much less often than the digits 0-9, so I think hijacking \d for this case is poor use of name space. The \d escape in perl does not cover other Unicode numerals, and even with the [:name:] syntax there appears to be no way to get the Unicode numerals: http://search.cpan.org/~flora/perl-5.14.2/pod/perlrecharclass.pod#POSIX_Character_Classes This suggests to me that it's not very useful. I know from experience that it's common for Arabic speakers to want to match both 0-9 and Arabic-Indic digits. The same seems true for Hindi/Devanagari digits, and probably others. Even if it wasn't often useful, IMO this change is necessary for congruity with Unicode-enabled \w and \b (I'll get to that), and would likely never be detrimental since /u would be opt-in and it's easy to explicitly use [0-9] when that's what you want. For the record, I am proposing that /\d/u be equivalent to /\p{Nd}/, not /\p{N}/. I.e., it should not match any Unicode number, but rather any Unicode decimal digit (see http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5Cp%7BNd%7D for the list). And as Norbert noted, that is in fact what Perl's \d matches. Ah, that makes much more sense. Comparison with other regex flavors: * \w == [A-Za-z0-9_] -- ES-current, Java, PCRE, Ruby, Python (default). * \w == [\p{L}\p{Nd}_] -- .NET, Perl, Python (with (?u)). * \b matches between ASCII \w\W -- ES-current, PCRE, Ruby, Python (default). * \b matches between Unicode \w\W -- Java, .NET, Perl, Python (with (?u)). * \d == [0-9] -- ES-current, Java, PCRE, Ruby, Python (default). * \d == \p{Nd} -- .NET, Perl, Python (with (?u)). * \s == [\x09-\x0D] -- Java, PCRE, Ruby, Python (default). * \s == [\x09–\x0D\p{Z}] -- ES-current, .NET, Perl, Python (with (?u)). Note that Java's \w and \b are inconsistent. Unicode-based \w and \b are incredibly useful, and it is very common for users to sometimes want them to be Unicode-based--thus, an opt-in flag offers the best of both worlds. In fact, I'd go so far as to say they are broken without Unicode support. Consider, e.g., /a\b/.test('naïve'), which currently returns true. Unicode-based \d would not only help international users/apps, it is also important because otherwise Unicode-based \w and \b would have to use [\p{L}0-9_] rather than [\p{L}\p{Nd}_], which breaks portability with .NET, Perl, Python, and Java. If, conversely, Unicode-enabled \w and \b used [\p{L}\p{Nd}_] but \d used [0-9], then among other consequences (including user confusion), [^\W\d_] could not be used equivalently to \p{L}. And instead of changing the meaning of \w, which will be confusing, I think that [:alnum:] as in perl would work fine. [:alnum:] in Perl, PCRE, Ruby, Tcl, POSIX/GNU BRE/ERE, etc. matches only [A-Za-z0-9]. Making it Unicode-based in ES would be confusing. It also works This would be pretty useless and is not true in perl. I tried the following: perl -e use utf8; print 'æ' =~ /[[:alnum:]]/ . \\n\; and it prints 1, indicating a match. only within character classes. IMO, the POSIX-style [[:name:]] syntax is clumsy and confusing, not to mention backward incompatible. It would potentially also be confusing if ES supports only [:alnum:] without adding the rest of the (not-very-useful) POSIX regex class names. The implication was to add the rest too. Seeing things like the regexp at the bottom of this page http://inimino.org/~inimino/blog/javascript_cset is an indication to me that there is a demand. \b is a little tougher. The Unicode rewrite would be (?:(?![:alnum:])(?=[:alnum:])|(?=[:alnum:])(?![:alnum:])) which is obviously too verbose. But if we take \b for this then the ASCII version has to be written as (?:(?!\w)(?=\w)|(?=\w)(?!\w)) which is also more than a little annoying. However, often you don't need that if you have negative lookbehind because you can write something like /(?!\w)word(?=!\w)/ // Negative look-behind for a \w and negative look-ahead for \w at the end. which isn't _too_ bad, even if it is much worse than /\bword\b/ I've already started to explain above why I think Unicode-based \b is important and useful. I'll just add the footnote that relying on lookbehind would in all likelihood perform less efficiently than \b (depending on implementation optimizations). OK, I'm convinced that /u should make \d, \b and \w Unicode aware. I don't think the
Re: Full Unicode based on UTF-16 proposal
This is very useful, and was surely a lot of work. I like the general thrust of it a lot. It has a high level of backwards compatibility, does not rely on the VM having two different string implementations in it, and it seems to fix the issues people are encountering. However I think we probably do want the /u modifier on regexps to control the new backward-incompatible behaviour. There may be some way to relax this for regexp literals in opted in Harmony code, but for new RegExp(...) and for other string literals I think there are rather too many inconsistencies with the old behaviour. The algorithm given for codePointAt never returns NaN. It should probably do that for indices that hit a trail surrogate that has a lead surrogate preceeding it. Perhaps it is outside the scope of this proposal, but it would also make a lot of sense to add some named character classes to RegExp. If we are makig a /u modifier for RegExp it would also be nice to get rid of the incorrect case independent matching rules. This is the section that says: If ch's code unit value is greater than or equal to decimal 128 and cu's code unit value is less than decimal 128, then return ch. 2012/3/16 Norbert Lindenberg ecmascr...@norbertlindenberg.com: Based on my prioritization of goals for support for full Unicode in ECMAScript [1], I've put together a proposal for supporting the full Unicode character set based on the existing representation of text in ECMAScript using UTF-16 code unit sequences: http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/index.html The detailed proposed spec changes serve to get a good idea of the scope of the changes, but will need some polishing. Comments? Thanks, Norbert [1] https://mail.mozilla.org/pipermail/es-discuss/2012-February/020721.html ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
Whew, a lot of work, Norbert. Looks quite good. My one question is whether it is worth having a mechanism for iteration. OLD CODE for (int i = 0; i s.length(); ++) { var x = s.charAt(i); // do something with x } Using your mechanism, one would write: NEW CODE for (int i = 0; i s.length(); ++) { var x = s.codePointAt(i); // do something with x if (x 0x) { ++i; } } In Java, for example, I *really* wish you could write: DESIRED for (int codepoint : s) { // do something with x } However, maybe this kind of iteration is rare enough in ES that it suffices to document the pattern under NEW CODE. Thanks for all your work! proposal for upgrading ECMAScript to a Unicode version released in this century This was amusing; could have said this millennium ;-) -- Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Fri, Mar 16, 2012 at 01:55, Erik Corry erik.co...@gmail.com wrote: This is very useful, and was surely a lot of work. I like the general thrust of it a lot. It has a high level of backwards compatibility, does not rely on the VM having two different string implementations in it, and it seems to fix the issues people are encountering. However I think we probably do want the /u modifier on regexps to control the new backward-incompatible behaviour. There may be some way to relax this for regexp literals in opted in Harmony code, but for new RegExp(...) and for other string literals I think there are rather too many inconsistencies with the old behaviour. The algorithm given for codePointAt never returns NaN. It should probably do that for indices that hit a trail surrogate that has a lead surrogate preceeding it. Perhaps it is outside the scope of this proposal, but it would also make a lot of sense to add some named character classes to RegExp. If we are makig a /u modifier for RegExp it would also be nice to get rid of the incorrect case independent matching rules. This is the section that says: If ch's code unit value is greater than or equal to decimal 128 and cu's code unit value is less than decimal 128, then return ch. 2012/3/16 Norbert Lindenberg ecmascr...@norbertlindenberg.com: Based on my prioritization of goals for support for full Unicode in ECMAScript [1], I've put together a proposal for supporting the full Unicode character set based on the existing representation of text in ECMAScript using UTF-16 code unit sequences: http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/index.html The detailed proposed spec changes serve to get a good idea of the scope of the changes, but will need some polishing. Comments? Thanks, Norbert [1] https://mail.mozilla.org/pipermail/es-discuss/2012-February/020721.html ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
On Sat, 17 Mar 2012 00:23:25 +0100, Mark Davis ☕ m...@macchiato.com wrote: Whew, a lot of work, Norbert. Looks quite good. My one question is whether it is worth having a mechanism for iteration. OLD CODE for (int i = 0; i s.length(); ++) { var x = s.charAt(i); // do something with x } Using your mechanism, one would write: NEW CODE for (int i = 0; i s.length(); ++) { var x = s.codePointAt(i); // do something with x if (x 0x) { ++i; } } In Java, for example, I *really* wish you could write: DESIRED for (int codepoint : s) { // do something with x } However, maybe this kind of iteration is rare enough in ES that it suffices to document the pattern under NEW CODE. That's the beauty of ECMAScript; it's extensible. :-) String.prototype.forEachCodePoint = function(fun) { for (var i=0; is.length; i++) { var x = s.codePointAt(i) fun(x, s) if (x 0x) { ++i } } } hello.forEachCodepoint(function(x) { // do something with x }) Thanks for all your work! proposal for upgrading ECMAScript to a Unicode version released in this century This was amusing; could have said this millennium ;-) -- Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** Jonas ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
Thanks for your comments - a few replies below. Norbert On Mar 16, 2012, at 1:55 , Erik Corry wrote: However I think we probably do want the /u modifier on regexps to control the new backward-incompatible behaviour. There may be some way to relax this for regexp literals in opted in Harmony code, but for new RegExp(...) and for other string literals I think there are rather too many inconsistencies with the old behaviour. Before asking developers to add /u, we should really have some evidence that not doing so would cause actual compatibility issues for real applications. Do you know of any examples? Good point about Harmony code, although it seems opt-in got replaced by being part of a module. The algorithm given for codePointAt never returns NaN. It should probably do that for indices that hit a trail surrogate that has a lead surrogate preceeding it. NaN is not a valid code point, so it shouldn't be returned. If we want to indicate access to a trailing surrogate code unit as an error, we should throw an exception. Perhaps it is outside the scope of this proposal, but it would also make a lot of sense to add some named character classes to RegExp. It would make a lot of sense, but is outside the scope of this proposal. One step at a time :-) If we are makig a /u modifier for RegExp it would also be nice to get rid of the incorrect case independent matching rules. This is the section that says: If ch's code unit value is greater than or equal to decimal 128 and cu's code unit value is less than decimal 128, then return ch. And the exception for ß and other characters whose upper case equivalent has more than one code point (If u does not consist of a single character, return ch. in the Canonicalize algorithm in ES 5.1). 2012/3/16 Norbert Lindenberg ecmascr...@norbertlindenberg.com: Based on my prioritization of goals for support for full Unicode in ECMAScript [1], I've put together a proposal for supporting the full Unicode character set based on the existing representation of text in ECMAScript using UTF-16 code unit sequences: http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/index.html The detailed proposed spec changes serve to get a good idea of the scope of the changes, but will need some polishing. Comments? Thanks, Norbert [1] https://mail.mozilla.org/pipermail/es-discuss/2012-February/020721.html ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
In Harmony we should be able to make this even more beautiful using iterators [1]: If we add: String.prototype.[iterator] = function() { var s = this; return { index: 0, next: function() { if (this.index = s.length) { throw StopIteration; } let cp = s.codePointAt(index); index += cp 0x ? 2 : 1; return cp; } } } clients can write: for (codePoint of str) { // do something with codePoint } Norbert [1] http://wiki.ecmascript.org/doku.php?id=harmony:iterators On Mar 16, 2012, at 17:04 , Jonas Höglund wrote: On Sat, 17 Mar 2012 00:23:25 +0100, Mark Davis ☕ m...@macchiato.com wrote: Whew, a lot of work, Norbert. Looks quite good. My one question is whether it is worth having a mechanism for iteration. OLD CODE for (int i = 0; i s.length(); ++) { var x = s.charAt(i); // do something with x } Using your mechanism, one would write: NEW CODE for (int i = 0; i s.length(); ++) { var x = s.codePointAt(i); // do something with x if (x 0x) { ++i; } } In Java, for example, I *really* wish you could write: DESIRED for (int codepoint : s) { // do something with x } However, maybe this kind of iteration is rare enough in ES that it suffices to document the pattern under NEW CODE. That's the beauty of ECMAScript; it's extensible. :-) String.prototype.forEachCodePoint = function(fun) { for (var i=0; is.length; i++) { var x = s.codePointAt(i) fun(x, s) if (x 0x) { ++i } } } hello.forEachCodepoint(function(x) { // do something with x }) Thanks for all your work! proposal for upgrading ECMAScript to a Unicode version released in this century This was amusing; could have said this millennium ;-) -- Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** Jonas ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss
Re: Full Unicode based on UTF-16 proposal
2012/3/17 Norbert Lindenberg ecmascr...@norbertlindenberg.com: Thanks for your comments - a few replies below. Norbert On Mar 16, 2012, at 1:55 , Erik Corry wrote: However I think we probably do want the /u modifier on regexps to control the new backward-incompatible behaviour. There may be some way to relax this for regexp literals in opted in Harmony code, but for new RegExp(...) and for other string literals I think there are rather too many inconsistencies with the old behaviour. Before asking developers to add /u, we should really have some evidence that not doing so would cause actual compatibility issues for real applications. Do you know of any examples? No. In general I don't think it is realistic to try to prove that problematic code does not exist, since that requires quantifying over all existing JS code, which is clearly impossible. Good point about Harmony code, although it seems opt-in got replaced by being part of a module. That would work too, I think. The algorithm given for codePointAt never returns NaN. It should probably do that for indices that hit a trail surrogate that has a lead surrogate preceeding it. NaN is not a valid code point, so it shouldn't be returned. If we want to indicate access to a trailing surrogate code unit as an error, we should throw an exception. Then you should probably remove the text: If there is no code unit at that position, the result is NaN from your proposal :-) I am wary of using exceptions for non-exceptional data-driven events, since performance is usually terrible and it's arguably an abuse of the mechanism. Your iterator code looks fine to me an needs neither NaN or exceptions. Perhaps it is outside the scope of this proposal, but it would also make a lot of sense to add some named character classes to RegExp. It would make a lot of sense, but is outside the scope of this proposal. One step at a time :-) I can see that. But if we are going to have multiple versions of the RegExp syntax we should probably aim to keep the number down. If we are makig a /u modifier for RegExp it would also be nice to get rid of the incorrect case independent matching rules. This is the section that says: If ch's code unit value is greater than or equal to decimal 128 and cu's code unit value is less than decimal 128, then return ch. And the exception for ß and other characters whose upper case equivalent has more than one code point (If u does not consist of a single character, return ch. in the Canonicalize algorithm in ES 5.1). Yes. 2012/3/16 Norbert Lindenberg ecmascr...@norbertlindenberg.com: Based on my prioritization of goals for support for full Unicode in ECMAScript [1], I've put together a proposal for supporting the full Unicode character set based on the existing representation of text in ECMAScript using UTF-16 code unit sequences: http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/index.html The detailed proposed spec changes serve to get a good idea of the scope of the changes, but will need some polishing. Comments? Thanks, Norbert [1] https://mail.mozilla.org/pipermail/es-discuss/2012-February/020721.html ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss ___ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss