On 26/03/2024 21:14, Casper Langemeijer wrote:
If you need someone to help for the grapheme_ marketing team, let me know.

I think a big part of the problem is that very few people dig into the complexities of text encoding, and so don't know that a "grapheme" is what they're looking for.

Unicode documentation is, generally, very careful with its terminology - distinguishing between "code points", "code units" "graphemes" , "grapheme clusters", "glyphs", etc. Pretty much everyone else just says "character", and assumes that everyone knows what they mean.


As a case in point, looking at the PHP manual pages for strlen, mb_strlen, and grapheme_strlen:

Short summary:

- strlen — Get string length
- mb_strlen — Get string length
- grapheme_strlen — Get string length in grapheme units

Description:

- Returns the length of the given string.
- Gets the length of a string.
- Get string length in grapheme units (not bytes or characters)


The first two don't actually say what units they're measuring in. Maybe it's millimetres? ;)

The last one uses the term "grapheme" without explaining what it means, and makes a contrast with "characters", which is confusing, as one of the definitions in the Unicode glossary [https://unicode.org/glossary/#grapheme] is:

> What a user thinks of as a character.


The mb_strlen documentation has a bit more explanation in its Return Values section:

> Returns the number of characters in string string having character encoding encoding. A multi-byte character is counted as 1.

For Unicode in particular, this is a poor description; it is completely missing the term "code point", which is what it actually counts.

That's probably because ext/mbstring wasn't written with Unicode in mind, it was "developed to handle Japanese characters", back in 2001; and it still does support several pre-Unicode "multi-byte encodings". For a bit of nostalgia: http://web.archive.org/web/20010605075550/http://www.php.net/manual/en/ref.mbstring.php


So... if you want to help make people more aware of the grapheme_* functions, one place to start would be editing the documentation for the various string, mbstring, and grapheme functions to use consistent terminology, and sign-post each other more clearly. http://doc.php.net/tutorial/


Regards,

--
Rowan Tommins
[IMSoP]

Reply via email to