The internationalization working group is planning to support grapheme clusters 
through its text segmentation API - the strawman:
http://wiki.ecmascript.org/doku.php?id=globalization:text_segmentation

Note that Unicode Standard Annex #29 allows for tailored (language sensitive) 
grapheme clusters, which makes ECMA-402 a better fit than ECMA-262.

Norbert


On Oct 24, 2013, at 7:02 , Claude Pache <[email protected]> wrote:

> Hello,
> 
> You might know that the following ES expressions are broken:
> 
>       text.charAt(0) // get the first character of the text
>       text.length > 100 ? text.substring(0,100) + '...' : text // cut the 
> text after 100 characters
> 
> The reason is *not* because ES works with UTF-16 code units instead of 
> Unicode code points (it's just a red herring!), but
> because _graphemes_ (that is, what a human perceives as a "character") may 
> span multiple code units and/or code points.
> For example, the letter "n̈" ("n" with diaeresis) is coded using two code 
> points, namely 
> U+006E (LATIN SMALL LETTER N) and U+0308 (COMBINING DIAERESIS), and two 
> UTF-16 code units, 
> and it should be avoided to cut a string between these two codes in order to 
> keep the resulting text meaningful for humans.
> (Nota bene: I have carefully chosen a grapheme that exists in a current 
> written language and does not exist 
> as precomposed Unicode character.)
> 
> The correct technical notion to use here, is the notion of "grapheme 
> cluster", that is 
> a sequence of Unicode code points that represents a grapheme.
> See [UAX29] (Unicode Standard Annex #29: Unicode text segmentation), section 
> 3, for more info.
> 
> Therefore, I propose the following basic operations to operate on grapheme 
> clusters:
> 
> (1) String.prototype.graphemeAt(pos)
>       
> This method is similar to `String.prototype.charAt()`, but it returns a 
> grapheme cluster instead of a string 
> composed of a single UTF-16 code unit. More precisely, it returns the 
> shortest substring of `this` 
> beginning at position `pos` (inclusively) and ending at position `pos2` 
> (exclusively), where `pos2` is the 
> smallest position in `this` which is greater than `pos` and which is an 
> (extended) grapheme cluster boundary,
> according to the specification in [UAX29], section 3.1.
> If `pos` is out of bounds, an empty string is returned.
> 
> (2) String.prototype.graphemes(start = 0)
> 
> This method returns an iterator, enumerating the graphemes of `this`, 
> starting at position `start`. 
> Given the `String.prototype.graphemeAt` method as above,
> it could be approximatively expressed in ES6 as follows (ignoring edge cases):
> 
>       String.prototype.graphemes = function*(pos = 0) {
>               pos = Math.floor(pos)
>               if (pos < 0 || Number.isNaN(pos))
>                       pos = 0
>               while (pos < this.length) {
>                       let grapheme = this.graphemeAt(pos)
>                       pos += grapheme.length
>                       yield grapheme
>               }
>       }
> 
> So, the two examples of the beginning of my message could be correctly 
> implemented as follows:
> 
>       text.graphemeAt(0) // get the first grapheme of the text
> 
>       // shorten a text to its first hundred graphemes
>       var shortenText = ''
>       let numGraphemes = 0
>       for (let grapheme of text) {
>               numGraphemes += 1
>               if (numGraphemes > 100) {
>                       shortenText += '…'
>                       break
>               }
>               shortenText += grapheme
>       }
> 
> As a side note, I ask whether the `String.prototype.symbolAt 
> `/`String.prototype.at` as proposed in a recent thread, 
> and the `String.prototype[@@iterator]` as currently specified, are really 
> what people need, 
> or if they would mistakenly use them with the intended meaning of 
> `String.prototype.graphemeAt`
> and `String.prototype.graphemes` as discussed in the present message?
> 
> Thoughts?
> 
> Claude
> 
> [UAX29]: http://www.unicode.org/reports/tr29/ "Unicode Standard Annex #29: 
> Unicode text segmentation."
> 
> _______________________________________________
> es-discuss mailing list
> [email protected]
> https://mail.mozilla.org/listinfo/es-discuss

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Reply via email to