On 11 August 2024 16:50:52 BST, Nick Lockheart <li...@ageofdream.com> wrote:
>It seems that if everything on the Internet is multi-byte encoded now,
>then all of the PHP string functions should be multi-byte safe.

The phrase "multibyte safe" may have made sense about 30 years ago, when it was 
thought that a "universal character set" could just be a "wide ASCII", encoding 
a straightforward list of characters, just more of them. 

Modern Unicode is so much more than that, because the world's writing systems 
don't all work the same way. Should strlen() measure bytes, code points, or 
graphemes? Should strtoupper() accept a locale, so it can handle cases like 
Turkish "dotless i" where "I" is not the uppercase of "i"? And so on, and so on.

I've seen plenty of languages boast that they are "Unicode aware" but few 
actually engaging with the question of what that actually means. Often they 
equate "character" with "code point" and stop there, which leads to results 
that are just as useless to most of the world as if they'd equated it with 
"byte".

Regards,
Rowan Tommins
[IMSoP]

Reply via email to