Hi all,

we still have 2 bundled extensions for working with strings in different
encodings: ext/mbstring and ext/iconv.  While working on bug #79200[1],
I've noticed that the implementation of many of the iconv_*() functions
is rather suboptimal.  This is mostly because iconv() is meant just for
character encoding *conversion*, but ext/iconv puts several other useful
string functions on top of that, but can't have these really optimized,
because the extension doesn't really know anything about those character
encodings.

For instance, iconv_strlen() is basically implemented by converting the
input string to UCS-4, and then simply counting the UCS-4 characters. On
the other hand, mb_strlen() makes use of length tables (where
appropriate), and as such does not even need to convert the string in
many typical cases.  Some quick benchmarks on getting the string length
of UTF-8 strings show that mb_strlen() is roughly 10 times faster than
iconv_strlen().  Now it would be trivially possible to improve the
iconv_strlen() implementation by converting a larger number of
characters in one go (instead of currently up to two only[2]), which
would make the function much faster (roughly 3 to 4 times for a 1024
character buffer), but still mb_strlen() would obviously beat that.

The situation for the other iconv_*() functions is similar, more or
less.  However, it seems that iconv() can be much faster than
mb_convert_encoding().  Quick benchmarks show a factor of 2 to 3.

So I wonder if we wouldn't be better off if we unbundle ext/iconv, but
move the iconv() function (and possibly the convert.iconv.* stream
filter) into ext/standard.  It shouldn't be hard to update code which
uses any of the iconv_*() functions to use respective mb_*() functions,
and users who couldn't do this, or don't want to for whatever reason,
could still use the iconv package available from PECL.  However, users
who would switch to mbstring would likely get better performance for
their applications.

For core developers that would obviously save time to maintain both
extensions.

For users learning PHP, and also for new code, it would be beneficial to
not have to decide which of these extensions to use; if they need
character encoding conversion, iconv() would be preferable; for more
general string functionality, it would be ext/mbstring.

Thoughts?

[1] <https://bugs.php.net/79200>
[2] <https://github.com/php/php-src/blob/php-7.4.3/ext/iconv/iconv.c#L714>

--
Christoph M. Becker

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to