On 08/03/2020 14:08, Dan Ackroyd wrote:
Related to this discussion, please could someone remind me why the
mbstring extension is an extension and not part of core PHP?

I realise at the time it was introduced, UTF-8 was far less widely
used: https://en.wikipedia.org/wiki/UTF-8#/media/File:Utf8webgrowth.svg

But now UTF-8 is pretty much the default for the vast majority of
projects, so does that decision to keep it as an optional extension
still hold up?


From what I can make out, mbstring was not actually built for Unicode string-handling, but for what we would now consider "legacy encodings". Its original niche seems to have been support for various Japanese text encodings, and UTF-8 support was added relatively late.

That has some implications for its design:

- every function takes encoding as a parameter, and defaults to a run-time global setting - on the other hand, there is no support for locales in functions which would benefit, e.g. mb_convert_case, mb_stripos - Unicode is treated as just another character encoding, so there is no support for concepts like normalisation, graphemes, character properties, etc - instead, there are lots of niche functions for CJK languages like mb_convert_kana and mb_strwidth

It also includes some things which probably wouldn't pass review if proposed today:

- a lot of global state, with combined get-or-set functions like mb_detect_order(), mb_substitute_character(), etc - mb_send_mail seems oddly specific, and has its own concept of "language" not shared by anything else - there's an entire regex implementation, with its own API and some compatibility with the removed ereg_* functions; I believe the preg_* functions included in core already support UTF-8


For handling of Unicode, ext/intl is generally superior, with a more structured API based on Unicode-specific concepts, rather than attempting to map them to concepts used in older character encodings. There may be a need for a more user-friendly subset of this (a "UString" class is a common suggestion), but it shouldn't look like ext/mbstring, IMHO.

I believe both extensions require fairly large external libraries, which probably justifies them being optional. From what I've read, ICU, which ext/intl is built on, would have been bundled with PHP 6, but its size and performance contributed to the failure of that project.


Regards,

--
Rowan Tommins (né Collins)
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to