Re: [PHP-DEV] Where are we ACTUALLY on Unicode?

Stan Vassilev Sun, 14 Mar 2010 04:04:51 -0700

If Unicode were the solution, the PHP project was on the right page with6.0.
Sure there remained work to do, but...
How long did it take to realize UTF16 wasn't the end of the story? UCS-4isthe minimum to solve this, and we all agree that 32 bits aren't storing asingle
char in the western world, no way, no how.
The UTF-8 solution is probably the right answer... you maintain 95% ofchar *UTFbehavior, and you gain international character representation. The onlyUnicodeOS I can think of offhand is NT, and of course they hit the UCS-4 problemearly.
They found this out 15+ years ago.
Sure it doesn't appear as atomic, one Xword per char, but the existinglibraryframeworks contain most of the string processing that is required. Thereis no16-bit network transmission API that I can think of, you are stilldevolving to
UTF-8 for client results.
To move forward with accepting -and preferring- UTF-8 as therepresentation ofcharacters throughout PHP, recognizing UTF-8 for char-lengthrepresentations,and so forth, would do wonders to move forwards. And 8-bit octet data canbeset aside in the same data structures. It is the straightforward answer,whichis probably why Linux did not repeat Windows NT decision, and adoptedutf-8.

Hi,

UTF8 is good for text that contains mostly ASCII chars and the occasionalUnicode international chars. It's also generally ok for storing and passingstrings between apps.

However, it's a really poor representation of a string in memory as a codepoint can vary between 1 and 4 bytes. Doing simple calculations like$string[$x] means you need to walk and interpret the string from the startuntil you count to the codepoint you needed.

UTF8 also takes 4 bytes for representing characters in the higher bitplanes, as quite a lot of bits are lost for every char in order to describehow long the code point is, and when it ends and so on. This meansmemory-wise it may not be of big benefit to asian countries.

Since the western world, as you put it, wouldn't want to waste 4 bytes forcharacters they use that fit in 1 byte, we could opt to store the encodingof a string as a byte enumerating all possible encodings supported by PHP (Ibelieve they're less than 255..), so the string functions know how tooperate and convert between them.

This means you can use Unicode only when you need it, which reduces theimpact of using full 4 bytes per code point, as you can still use Latin-11-byte encoding and freely mix it with Unicode, and still produce UTF8output in the end, for the web (the final output encoding to UTF8 from*anything* is cheap).

Another alternative is doing what JavaScript does. JavaScript uses 2-byteencoding for Unicode, and when a code point needs more than 2 bytes, it'sencoded in 4 bytes. JavaScript will count that codepoint as 2 chars,although it's technically one codepoint. It's awkward, but since PHP is aweb language, consistency with JavaScript may even be beneficial. It alsosolves the $string[$x] problem as you no longer need to walk the array, youjust blindly report the 2 bytes at address string points + 2 * $x.

With this approach, all characters in the BMP will report correct offsetswith char index and substr functions as they fit in 2 bytes. Workarounds andhelper functions can be introduced for handling 4 byte codepoints for theother planes.

It of course makes certain operations harder, such as character rangesbetween two 4-byte codepoints in regex will produce unexpected results, andregex will see these chars:


[2bytes2bytes-2bytes2bytes] i.e.:   [a b-c d]

and not this:

[4bytes-4bytes]

Still, having variable-width encoding UTF8 or UTF16 doesn't cut it forgeneral use to me as in tests it shows drastic slowdown when the scriptneeds to do heavy string processing. I'd rather have it take more RAM forUnicode strings while being fast, and use Latin-1 when what I need isLatin-1.


Regards,

Stan Vassilev


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] Where are we ACTUALLY on Unicode?

Reply via email to