If Unicode were the solution, the PHP project was on the right page with
6.0.
Sure there remained work to do, but...
How long did it take to realize UTF16 wasn't the end of the story? UCS-4
is
the minimum to solve this, and we all agree that 32 bits aren't storing a
single
char in the western world, no way, no how.
The UTF-8 solution is probably the right answer... you maintain 95% of
char *UTF
behavior, and you gain international character representation. The only
Unicode
OS I can think of offhand is NT, and of course they hit the UCS-4 problem
early.
They found this out 15+ years ago.
Sure it doesn't appear as atomic, one Xword per char, but the existing
library
frameworks contain most of the string processing that is required. There
is no
16-bit network transmission API that I can think of, you are still
devolving to
UTF-8 for client results.
To move forward with accepting -and preferring- UTF-8 as the
representation of
characters throughout PHP, recognizing UTF-8 for char-length
representations,
and so forth, would do wonders to move forwards. And 8-bit octet data can
be
set aside in the same data structures. It is the straightforward answer,
which
is probably why Linux did not repeat Windows NT decision, and adopted
utf-8.
Hi,
UTF8 is good for text that contains mostly ASCII chars and the occasional
Unicode international chars. It's also generally ok for storing and passing
strings between apps.
However, it's a really poor representation of a string in memory as a code
point can vary between 1 and 4 bytes. Doing simple calculations like
$string[$x] means you need to walk and interpret the string from the start
until you count to the codepoint you needed.
UTF8 also takes 4 bytes for representing characters in the higher bit
planes, as quite a lot of bits are lost for every char in order to describe
how long the code point is, and when it ends and so on. This means
memory-wise it may not be of big benefit to asian countries.
Since the western world, as you put it, wouldn't want to waste 4 bytes for
characters they use that fit in 1 byte, we could opt to store the encoding
of a string as a byte enumerating all possible encodings supported by PHP (I
believe they're less than 255..), so the string functions know how to
operate and convert between them.
This means you can use Unicode only when you need it, which reduces the
impact of using full 4 bytes per code point, as you can still use Latin-1
1-byte encoding and freely mix it with Unicode, and still produce UTF8
output in the end, for the web (the final output encoding to UTF8 from
*anything* is cheap).
Another alternative is doing what JavaScript does. JavaScript uses 2-byte
encoding for Unicode, and when a code point needs more than 2 bytes, it's
encoded in 4 bytes. JavaScript will count that codepoint as 2 chars,
although it's technically one codepoint. It's awkward, but since PHP is a
web language, consistency with JavaScript may even be beneficial. It also
solves the $string[$x] problem as you no longer need to walk the array, you
just blindly report the 2 bytes at address string points + 2 * $x.
With this approach, all characters in the BMP will report correct offsets
with char index and substr functions as they fit in 2 bytes. Workarounds and
helper functions can be introduced for handling 4 byte codepoints for the
other planes.
It of course makes certain operations harder, such as character ranges
between two 4-byte codepoints in regex will produce unexpected results, and
regex will see these chars:
[2bytes2bytes-2bytes2bytes] i.e.: [a b-c d]
and not this:
[4bytes-4bytes]
Still, having variable-width encoding UTF8 or UTF16 doesn't cut it for
general use to me as in tests it shows drastic slowdown when the script
needs to do heavy string processing. I'd rather have it take more RAM for
Unicode strings while being fast, and use Latin-1 when what I need is
Latin-1.
Regards,
Stan Vassilev
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php