Re: Working with grapheme clusters

Claude Pache Thu, 24 Oct 2013 08:18:12 -0700

Le 24 oct. 2013 à 16:24, Mathias Bynens <[email protected]> a écrit :


> 
>>      text.graphemeAt(0) // get the first grapheme of the text
>> 
>>      // shorten a text to its first hundred graphemes
>>      var shortenText = ''
>>      let numGraphemes = 0
>>      for (let grapheme of text) {
>>              numGraphemes += 1
>>              if (numGraphemes > 100) {
>>                      shortenText += '…'
>>                      break
>>              }
>>              shortenText += grapheme
>>      }
> 
> So, you would want to change the string iterator’s behavior too?

At least, I'd like to have the opportunity to iterate over what I need. I have 
no opinion whether iterating through code points, or through grapheme clusters, 
should be the default
iterator, or if there should be none, forcing the developer to consciously pick 
the one they really mean.

> 
>> As a side note, I ask whether the `String.prototype.symbolAt 
>> `/`String.prototype.at` as proposed in a recent thread, and the 
>> `String.prototype[@@iterator]` as currently specified, are really what 
>> people need, or if they would mistakenly use them with the intended meaning 
>> of `String.prototype.graphemeAt` and `String.prototype.graphemes` as 
>> discussed in the present message?
> 
> I don’t think this would be an issue. The new `String` methods and the 
> iterator are well-defined and documented in terms of *code points*.
> 
> IMHO combining marks are easy enough to match and special-case in your code 
> if that’s what you need. You could use a regular expression to iterate over 
> all grapheme clusters in the string:
> 
>    // Based on the example on 
> http://mathiasbynens.be/notes/javascript-unicode#accounting-for-other-combining-marks
>    var regexGraphemeCluster = 
> /([\0-\u02FF\u0370-\u1DBF\u1E00-\u20CF\u2100-\uD7FF\uDC00-\uFE1F\uFE30-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF])([\u0300-\u036F\u
>     1DC0-\u1DFF\u20D0-\u20FF\uFE20-\uFE2F]*)/g;

Note that the specification in [UAX29], section 3.1, for determining grapheme 
cluster boundaries does not just use the notion of "combining marks". I fear 
that, for some exotic scripts (apparently, at least Hangul), it is more 
complicated than just finding a span of combining marks.

—Claude


[UAX29]: http://www.unicode.org/reports/tr29/ "Unicode Standard Annex #29: 
Unicode text segmentation."

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Working with grapheme clusters

Reply via email to