Although I never contributed to the code of the PHP project, I hope the ideas that follow may provide some suggestion for the future developments of the PHP language. The basic idea is described in the abstract, and those that are not interested may stop there :-)
Abstract ======== My modest proposal to provide Unicode support to PHP without the need to rewrite the whole engine and its libraries introducing the UString abstraction layer as a regular class, an with minimal support from the core engine. Basically, the UString class hides the internal implementation of the Unicode strings and allows to experiment with several solutions (UTF-8, UCS-2, UCS-4, ...). Several different implementations may be attempted and may also be made available, leaving to the user freedom to choose the best compromise between performances and memory footprint. Two useful functions, u() and uecho() are also discussed, that may help in writing Unicode interoperable source programs and libraries; these latter functions require some support from the PHP engine. The UString class ================= 1. The UString class holds an immutable array of Unicode characters, it is final and hides the internal representation of the Unicode string, that may be UTF-8 or UCS-2 or anything else: final class UString implements Hashable, Comparable, Sortable, UPrintable, Printable, Serializable { ... } (Besides the well known Serializable interface, the other implemented interfaces will be discussed later.) All the PHP programs and the external libraries based on this class are completely unaware of the actual internal encoding used. Several implementations may be provided to fit different needs. In western european countries the UTF-8 works well and some optimizations allow performances that are very close to ordinary "string" of bytes. For example, in my PHP implementation an internal byte index allows to scan quickly forward an backward the UTF-8 sequences. 2. The UString class has no constructor, but several factory methods that take arrays of bytes (aka string): static UString function fromASCII(string $s) static UString function fromUTF8(string $s) static UString function fromISO88591(string $s) static UString function fromUTF16LE(string $s) static UString function fromUTF16BE(string $s) ... These factory methods silently skip invalid bytes and invalid sequences, possibly replacing them with '?'. No warning, no exceptions. Other utility functions may be provided that check an array of bytes for a specific encoding. Several corresponding instance methods perform the revers traslation into an array of bytes: string function toASCII() string function toUTF8() string function toISO88591() string function toUTF16LE() string function toUTF16BE() ... So, for example, UString::fromUTF8( $u->toUTF8() )->equals($u) is always TRUE for any Unicode string $u (the equals() method is describe below). 3. The UString class provides the usual string manipulation routines: int function length() UString function substring($from, $to) UString function charAt($index) UString function append(UString $u) bool function startsWith(UString $u) bool function endsWith(UString $u) int function indexOf(UString $u) UString function trim($blacklist = u("\n\r\t")) bool function equalsIgnoreCase(UString $u) UString function toUpperCase() UString function toLowerCase() UString[] function explode($separator = u(" ")) UString function implode($separator = u("")) ... (More about the magic u() function later). 4. The UString class implements the UPrintable interface that returns "the best human-readable represesentation of the object as a UString string", that is the string itself: UString function __toUString(){ return $this; } 5. The UString class implements the Printable interface that returns "the best human-readable representation of the object as a string, possibly composed of ASCII characters only": string function __toString(){ return $this->toASCII(); } 6. The UString class implements the Hashable interface, useful to implement hashing algorithms (hasMap, HashSet, ...): int function getHash(){ ... } Since UString is immutable, this function may compute the hash once for all. In my current PHP implementation I have used crc32(), but the PHP engine hides a more efficient hashing function that might be used instead (what about making it available in userland code?). 7. The UString class implements the Comparable interface: bool function equals(object $u) that returns TRUE only if the object is UString and contains the same sequence of Unicode characters and returns FALSE in any other case. 8. The UString class implements the Sortable interface: int function compareTo(object $u) that returns -1, 0 or +1 if $u is UString, or raises E_WARNING if $u is not UString. The u() and the uecho() magic functions ======================================= Two "magic" functions helps in writing PHP programs. Basically, u() is a factory function that translates an array of bytes into a UString, for example: UString function u(string $s){ return UString::fromXxx($s); } $hello = u("hello"); where Xxx is the encoding of the source. But the u() function may do much more than this, and the implementation I have made provides several other features: - Literal strings are chached, so if the u("hello") statement is executed several times, only one single UString object is created once for all and this object is returned each time the u("hello") function is evaluated. Since in a source program the number of literal strings is finite, the string cache will result to be finite as well. - Automatically converts any type of data into UString, so that u(123) yields the same as UString::fromASCII( (string) 123 ). Here too, small numbers, the most common ones, can be cached. - If the argument is an object that implements the UPrintable interface, its __toUString() method is called; if it implements Printable, the __toString() method is called. Boolean values generate "FALSE" or "TRUE". NULL value generates "NULL". etc. - If the argument is UString, gives itself. - If several arguments are provided, each argument is converted into UString and concatenated with UString::append(). All this can be implemented in PHP source, and does not require changes to the engine. The uecho() function does just the same, but also sends the result to stdout using the chosen encoding: uecho(...) ====> echo u(...)->toXxx(); where Xxx is the encoding corresponding to the that used in the u() function to translate literal strings. Programmers may then write something like this: function TenHelloWorld() { $hello = u("hello"); $world = u("world!"); for($i = 0; $i < 10; $i++) uecho($i, " - ", $hello, ", ", $world); } This function generates 2 objects for the $hello and $world vars (cached), 2 objects for the " - " and the ", " literal strings (cached), 10 objects for the $i numbers (possibly cached), and other 10 objects for the resulting concatenation of the strings. If this function is called again, cached values are reused again. Support from the PHP engine =========================== R1. First of all, the current implementation of the UString class, being bare PHP code, isn't very efficient, and a C implementation of all or at least some of the most critical sections of code could greatly improve performances. Since the PHP code developed around the UString class does not depend on the internal representation of the Unicode characters, several implementations may be tested and the final decision about the "standard" one can be postponed. Or, the choice can be left to the users, that may choose the implementation that better fit their needs. R2. Second, the u() function requires some support from the PHP engine because it cannot be used in static expressions. The following code, for example, is not valid: function f( UString $s = u("xxx") ){ ... } class MyClass { const C = u("xxx"), A = array( u("zero"), u("one"), u("two") ); } If the PHP restriction imposed to the static expressions could be relaxed a bit at least for some magic function, the code above would be possible. R3. Third, only the engine may establish if a string that enter the u() function is really a literal string and not a dynamically generated string. For example, in my current PHP implementation I can only warn the programmer in the documentation from doing things like for($i = 0; $i < 10000; $i) uecho( "cycle no. $i" ); that would pollute the cache of u() with thousands of unuseful dynamically generates strings. Or, even better, the PHP engine itself might split the string and rewrite it automatically in a cache-aware way as: uecho("cycle no. ", $i); R4. Another area where some support from the PHP engine would be useful, is the detection of the encoding used in the source, so that the Xxx encoding to be used in the u() and uecho() functions can be automatically determined. In my current PHP implementation I stick with UTF-8, but a more general approach may take advantage from the new declare(encoding="Xxx") statement. For example, the engine might instantiate a "translator" object to be used for the current source, and this translator object might be made available to the program as a global variable that tranlates from array of bytes to UString and vice-versa: interface EncodingTranslator { # Encoder call-back: UString function encode(string $s); # Decoder call-back: string function decode(UString $u); } # UTF-8 specific encoder/decoder functions pair: class UTF8EncodingTranslator implements EncodingTranslator { EncodingTranslator function getInstance(){...} UString function encode(string $s){ return UString::fromUTF8($s); } string function decode(UString $u){ return $u->toUTF8(); } } # Other encoder/decoder functions pairs: class ISO88591EncodingTranslator implements EncodingTranslator { ... } class UTF16LEEncodingTranslator implements EncodingTranslator { ... } ... class ASCIIEncodingTranslator implements EncodingTranslator { ... } # Here the PHP engine creates the per-source file translator object, # setting the $curr_encoding_translator variable: if( the engine detected this src is UTF8 encoded ){ $curr_encoding_translator = new UTF8EncodingTranslator::getInstance(); } else if( the engine detected this src is ISO-8859-1 encoded ){ $curr_encoding_translator = new ISO88591EncodingTranslator::getInstance(); ... } else { $curr_encoding_translator = new ASCIIEncodingTranslator::getInstance(); } The u() function may then use the global variable $curr_encoding_translator to encode and decode every string and in every specific source program, that is $curr_encoding_translator change its value according to the source which is currently under execution. In this way libraries can be developed separately with different source encodings without affecting the interoperability with past and future programs. Further developments ==================== - I/O functions that support Unicode file names through modern dedicated classes/functions: FileOutputStream, FileInputStream, etc. This is particularly required under Windows whose file system uses the UCS-2 encoding (in brief: replace fopen() with _wfopen() etc. under Windows, or provide another "hook" that exposes these functions to PHP sources so that these classes can be implemented in PHP code). - String pattern matching, aka regex, but fully Unicode aware. - Data base abstraction layer encoding-independent. - A new generation of portable libraries. Proof of the concept - The actual PHP implementation ==================================================== The PHP implementation of all this is available either as documentation and as PHP source at this address: http://www.icosaedro.it/phplint/libraries.cgi Almost all the classes listed above are available, in particular: UString UPattern (regex with UString) utf8.php (provides u() and uecho() for UTF-8 only) FileName (attempt to support Unicode file names on Linux and Win) Regards, ___ /_|_\ Umberto Salsi \/_\/ www.icosaedro.it -- PHP Unicode & I18N Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php