On 20/06/2019 23:30, Mark Randall wrote:
There does at least seem to be the starting point in that mb_string is
already widely used, and my suggestion that it "work as expected" is
more that it would work as the equivalent mb_string / iconv function
would.
I think this is a rather short-sighted way of looking at it. If people
want the API provided by the mbstring extension, they can just use those
functions; the advantage of designing a new set of functions is surely
that we don't need to stick to past decisions. If we start to build a
new standard library, as Zeev suggested in the deprecation thread, it is
a once-in-a-lifetime chance to build something better, not just copy
what's gone before.
mb_strlen returns the number of codepoints for example, I'm not
immediately seeing anything about mb_string supporting Graphemes as
the only reference I could find to their manipulation was The intl
extension.
The mbstring extension was not built for Unicode, but for older Japanese
multi-byte encodings, where the definition of "character" is much more
straight-forward. Its Unicode support seems to mostly see code points as
mappings for characters in some other encoding. (The oldest manual page
for it on archive.org [1] is from 2001, and includes the quaint remark
"As Unicode is getting popular, UTF-8 is used also.") The iconv library
is even more explicitly aimed at converting between character sets,
rather than understanding them (the extra functions such as iconv_strlen
are unique to PHP).
Unicode today is much more than a mapping of legacy encodings to a
universal character set, and I can think of no useful purpose in
declaring the "string length" of the British flag emoji to be 2, just
because it is encoded as the sequence U+1F1EC U+1F1E7.
[1]
http://web.archive.org/web/20010605075550/http://www.php.net/manual/en/ref.mbstring.php
Regards,
--
Rowan Collins
[IMSoP]
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php