Working with grapheme clusters

Claude Pache Thu, 24 Oct 2013 07:02:54 -0700

Hello,

You might know that the following ES expressions are broken:


        text.charAt(0) // get the first character of the text
        text.length > 100 ? text.substring(0,100) + '...' : text // cut the 
text after 100 characters

The reason is *not* because ES works with UTF-16 code units instead of Unicode 
code points (it's just a red herring!), but
because _graphemes_ (that is, what a human perceives as a "character") may span 
multiple code units and/or code points.
For example, the letter "n̈" ("n" with diaeresis) is coded using two code 
points, namely 
U+006E (LATIN SMALL LETTER N) and U+0308 (COMBINING DIAERESIS), and two UTF-16 
code units, 
and it should be avoided to cut a string between these two codes in order to 
keep the resulting text meaningful for humans.
(Nota bene: I have carefully chosen a grapheme that exists in a current written 
language and does not exist 
as precomposed Unicode character.)

The correct technical notion to use here, is the notion of "grapheme cluster", 
that is 
a sequence of Unicode code points that represents a grapheme.
See [UAX29] (Unicode Standard Annex #29: Unicode text segmentation), section 3, 
for more info.

Therefore, I propose the following basic operations to operate on grapheme 
clusters:

(1) String.prototype.graphemeAt(pos)
        
This method is similar to `String.prototype.charAt()`, but it returns a 
grapheme cluster instead of a string 
composed of a single UTF-16 code unit. More precisely, it returns the shortest 
substring of `this` 
beginning at position `pos` (inclusively) and ending at position `pos2` 
(exclusively), where `pos2` is the 
smallest position in `this` which is greater than `pos` and which is an 
(extended) grapheme cluster boundary,
according to the specification in [UAX29], section 3.1.
If `pos` is out of bounds, an empty string is returned.

(2) String.prototype.graphemes(start = 0)

This method returns an iterator, enumerating the graphemes of `this`, starting 
at position `start`. 
Given the `String.prototype.graphemeAt` method as above,
it could be approximatively expressed in ES6 as follows (ignoring edge cases):

        String.prototype.graphemes = function*(pos = 0) {
                pos = Math.floor(pos)
                if (pos < 0 || Number.isNaN(pos))
                        pos = 0
                while (pos < this.length) {
                        let grapheme = this.graphemeAt(pos)
                        pos += grapheme.length
                        yield grapheme
                }
        }

So, the two examples of the beginning of my message could be correctly 
implemented as follows:

        text.graphemeAt(0) // get the first grapheme of the text

        // shorten a text to its first hundred graphemes
        var shortenText = ''
        let numGraphemes = 0
        for (let grapheme of text) {
                numGraphemes += 1
                if (numGraphemes > 100) {
                        shortenText += '…'
                        break
                }
                shortenText += grapheme
        }

As a side note, I ask whether the `String.prototype.symbolAt 
`/`String.prototype.at` as proposed in a recent thread, 
and the `String.prototype[@@iterator]` as currently specified, are really what 
people need, 
or if they would mistakenly use them with the intended meaning of 
`String.prototype.graphemeAt`
and `String.prototype.graphemes` as discussed in the present message?

Thoughts?

Claude

[UAX29]: http://www.unicode.org/reports/tr29/ "Unicode Standard Annex #29: 
Unicode text segmentation."

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Working with grapheme clusters

Reply via email to