Hello!
I am thinking that we're doing something with the unicode implementation and
that's that we're now getting duplicate implementations of quite some things:
functions, internal functions, hash implementations, two ways for storing
identifiers... only because we need to support both IS_STRING and IS_UNICODE
and unicode=off mode.
I think I would prefer an IS_UNICODE/unicode=on only PHP.
This would mean that:
- no duplicate functionality for tons of functions that will make maintaining
the thing very hard
- a cleaner (and a bit faster) Unicode implementation
- we have a bit less BC.
Internally we would only see IS_UNICODE and IS_BINARY, where we can have a
small layer around extensions which return IS_STRING where we automatically
convert it to and from unicode for those extensions. IS_STRING strings will
still exist, but should not be there for the "user level".
For things like:
$str = unicode_convert($unicode, 'iso-2022');
and $unicode being "IS_UNICODE". $str will now be an IS_BINARY string, with all
the restrictions that we already have on those strings (like no automatic
conversions).
Functions that work on binary strings can be quite limited (we wouldn't need a
strtolower for that f.e.), so we are cutting down in a lot of duplicated code.
The same goes for not having to support both unicode=off and unicode=on mode,
as that can make things a bit complicated too. This will limit functionality on
binary strings a bit though, but I think this is 10 times better than an
unmaintainable PHP with Unicode support.
Besides this, I ran some micro benchmarks on about 600 characters of text with
a few functions and benchmarked their behavior between unicode=1 and unicode=0
mode. Results:
strrev (100.000 iterations over 600 characters of normalized latin text):
unicode off: 1.8secs
unicode on: 5.0secs
strtoupper (100.000 iterations over the same text):
unicode off: 2.2secs
unicode on: 7.9secs
substr(50, 100) (1.000.000 over the same text):
unicode off: 3.9secs
unicode on: 11.9secs
This is something I find quite not acceptable, and we need to figure out a way
on how to optimize this - for substr the penalty is probably what we are using
an iterator and not a direct memcpy (because of surrogates), I am not so sure
about the others.
regards,
Derick
--
Derick Rethans
http://derickrethans.nl | http://ez.no | http://xdebug.org
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php