If Unicode were the solution, the PHP project was on the right page with 6.0.
Sure there remained work to do, but...

How long did it take to realize UTF16 wasn't the end of the story? UCS-4 is the minimum to solve this, and we all agree that 32 bits aren't storing a single
char in the western world, no way, no how.

The UTF-8 solution is probably the right answer... you maintain 95% of char *UTF behavior, and you gain international character representation. The only Unicode OS I can think of offhand is NT, and of course they hit the UCS-4 problem early.
They found this out 15+ years ago.

Sure it doesn't appear as atomic, one Xword per char, but the existing library frameworks contain most of the string processing that is required. There is no 16-bit network transmission API that I can think of, you are still devolving to
UTF-8 for client results.

To move forward with accepting -and preferring- UTF-8 as the representation of characters throughout PHP, recognizing UTF-8 for char-length representations, and so forth, would do wonders to move forwards. And 8-bit octet data can be set aside in the same data structures. It is the straightforward answer, which is probably why Linux did not repeat Windows NT decision, and adopted utf-8.


Hi,

UTF8 is good for text that contains mostly ASCII chars and the occasional Unicode international chars. It's also generally ok for storing and passing strings between apps.

However, it's a really poor representation of a string in memory as a code point can vary between 1 and 4 bytes. Doing simple calculations like $string[$x] means you need to walk and interpret the string from the start until you count to the codepoint you needed.

UTF8 also takes 4 bytes for representing characters in the higher bit planes, as quite a lot of bits are lost for every char in order to describe how long the code point is, and when it ends and so on. This means memory-wise it may not be of big benefit to asian countries.

Since the western world, as you put it, wouldn't want to waste 4 bytes for characters they use that fit in 1 byte, we could opt to store the encoding of a string as a byte enumerating all possible encodings supported by PHP (I believe they're less than 255..), so the string functions know how to operate and convert between them.

This means you can use Unicode only when you need it, which reduces the impact of using full 4 bytes per code point, as you can still use Latin-1 1-byte encoding and freely mix it with Unicode, and still produce UTF8 output in the end, for the web (the final output encoding to UTF8 from *anything* is cheap).

Another alternative is doing what JavaScript does. JavaScript uses 2-byte encoding for Unicode, and when a code point needs more than 2 bytes, it's encoded in 4 bytes. JavaScript will count that codepoint as 2 chars, although it's technically one codepoint. It's awkward, but since PHP is a web language, consistency with JavaScript may even be beneficial. It also solves the $string[$x] problem as you no longer need to walk the array, you just blindly report the 2 bytes at address string points + 2 * $x.

With this approach, all characters in the BMP will report correct offsets with char index and substr functions as they fit in 2 bytes. Workarounds and helper functions can be introduced for handling 4 byte codepoints for the other planes.

It of course makes certain operations harder, such as character ranges between two 4-byte codepoints in regex will produce unexpected results, and regex will see these chars:

[2bytes2bytes-2bytes2bytes] i.e.:   [a b-c d]

and not this:

[4bytes-4bytes]

Still, having variable-width encoding UTF8 or UTF16 doesn't cut it for general use to me as in tests it shows drastic slowdown when the script needs to do heavy string processing. I'd rather have it take more RAM for Unicode strings while being fast, and use Latin-1 when what I need is Latin-1.

Regards,
Stan Vassilev

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to