On 26/03/2024 21:14, Casper Langemeijer wrote:
If you need someone to help for the grapheme_ marketing team, let me know.
I think a big part of the problem is that very few people dig into the
complexities of text encoding, and so don't know that a "grapheme" is
what they're looking for.
Unicode documentation is, generally, very careful with its terminology -
distinguishing between "code points", "code units" "graphemes" ,
"grapheme clusters", "glyphs", etc. Pretty much everyone else just says
"character", and assumes that everyone knows what they mean.
As a case in point, looking at the PHP manual pages for strlen,
mb_strlen, and grapheme_strlen:
Short summary:
- strlen — Get string length
- mb_strlen — Get string length
- grapheme_strlen — Get string length in grapheme units
Description:
- Returns the length of the given string.
- Gets the length of a string.
- Get string length in grapheme units (not bytes or characters)
The first two don't actually say what units they're measuring in. Maybe
it's millimetres? ;)
The last one uses the term "grapheme" without explaining what it means,
and makes a contrast with "characters", which is confusing, as one of
the definitions in the Unicode glossary
[https://unicode.org/glossary/#grapheme] is:
> What a user thinks of as a character.
The mb_strlen documentation has a bit more explanation in its Return
Values section:
> Returns the number of characters in string string having character
encoding encoding. A multi-byte character is counted as 1.
For Unicode in particular, this is a poor description; it is completely
missing the term "code point", which is what it actually counts.
That's probably because ext/mbstring wasn't written with Unicode in
mind, it was "developed to handle Japanese characters", back in 2001;
and it still does support several pre-Unicode "multi-byte encodings".
For a bit of nostalgia:
http://web.archive.org/web/20010605075550/http://www.php.net/manual/en/ref.mbstring.php
So... if you want to help make people more aware of the grapheme_*
functions, one place to start would be editing the documentation for the
various string, mbstring, and grapheme functions to use consistent
terminology, and sign-post each other more clearly.
http://doc.php.net/tutorial/
Regards,
--
Rowan Tommins
[IMSoP]