Hi, I think that there aren't that many people subscribed to this list, so I'm ccing the internals list, as your suggestion is to implement/bundle this to the core.
On Wed, Mar 21, 2012 at 7:23 PM, Umberto Salsi <sa...@icosaedro.it> wrote: > Although I never contributed to the code of the PHP project, I hope the > ideas > that follow may provide some suggestion for the future developments of the > PHP > language. The basic idea is described in the abstract, and those that are > not > interested may stop there :-) > > > Abstract > ======== > > My modest proposal to provide Unicode support to PHP without the need to > rewrite the whole engine and its libraries introducing the UString > abstraction > layer as a regular class, an with minimal support from the core engine. > Basically, the UString class hides the internal implementation of the > Unicode > strings and allows to experiment with several solutions (UTF-8, UCS-2, > UCS-4, > ...). Several different implementations may be attempted and may also be > made > available, leaving to the user freedom to choose the best compromise > between > performances and memory footprint. Two useful functions, u() and uecho() > are > also discussed, that may help in writing Unicode interoperable source > programs > and libraries; these latter functions require some support from the PHP > engine. > > > The UString class > ================= > > 1. The UString class holds an immutable array of Unicode characters, it is > final and hides the internal representation of the Unicode string, that > may be > UTF-8 or UCS-2 or anything else: > > final class UString > implements Hashable, Comparable, Sortable, UPrintable, > Printable, Serializable > { ... } > > (Besides the well known Serializable interface, the other implemented > interfaces will be discussed later.) > > All the PHP programs and the external libraries based on this class are > completely unaware of the actual internal encoding used. Several > implementations may be provided to fit different needs. In western european > countries the UTF-8 works well and some optimizations allow performances > that > are very close to ordinary "string" of bytes. For example, in my PHP > implementation an internal byte index allows to scan quickly forward an > backward the UTF-8 sequences. > > 2. The UString class has no constructor, but several factory methods that > take > arrays of bytes (aka string): > > static UString function fromASCII(string $s) > static UString function fromUTF8(string $s) > static UString function fromISO88591(string $s) > static UString function fromUTF16LE(string $s) > static UString function fromUTF16BE(string $s) > ... > > These factory methods silently skip invalid bytes and invalid sequences, > possibly replacing them with '?'. No warning, no exceptions. Other utility > functions may be provided that check an array of bytes for a specific > encoding. > > Several corresponding instance methods perform the revers traslation into > an > array of bytes: > > string function toASCII() > string function toUTF8() > string function toISO88591() > string function toUTF16LE() > string function toUTF16BE() > ... > > So, for example, > > UString::fromUTF8( $u->toUTF8() )->equals($u) > > is always TRUE for any Unicode string $u (the equals() method is describe > below). > > 3. The UString class provides the usual string manipulation routines: > > int function length() > UString function substring($from, $to) > UString function charAt($index) > UString function append(UString $u) > bool function startsWith(UString $u) > bool function endsWith(UString $u) > int function indexOf(UString $u) > UString function trim($blacklist = u("\n\r\t")) > bool function equalsIgnoreCase(UString $u) > UString function toUpperCase() > UString function toLowerCase() > UString[] function explode($separator = u(" ")) > UString function implode($separator = u("")) > ... > > (More about the magic u() function later). > > 4. The UString class implements the UPrintable interface that returns "the > best > human-readable represesentation of the object as a UString string", that > is the > string itself: > > UString function __toUString(){ return $this; } > > 5. The UString class implements the Printable interface that returns "the > best > human-readable representation of the object as a string, possibly composed > of > ASCII characters only": > > string function __toString(){ return $this->toASCII(); } > > 6. The UString class implements the Hashable interface, useful to implement > hashing algorithms (hasMap, HashSet, ...): > > int function getHash(){ ... } > > Since UString is immutable, this function may compute the hash once for > all. In > my current PHP implementation I have used crc32(), but the PHP engine > hides a > more efficient hashing function that might be used instead (what about > making > it available in userland code?). > > 7. The UString class implements the Comparable interface: > > bool function equals(object $u) > > that returns TRUE only if the object is UString and contains the same > sequence > of Unicode characters and returns FALSE in any other case. > > 8. The UString class implements the Sortable interface: > > int function compareTo(object $u) > > that returns -1, 0 or +1 if $u is UString, or raises E_WARNING if $u is not > UString. > > > > The u() and the uecho() magic functions > ======================================= > > Two "magic" functions helps in writing PHP programs. Basically, u() is a > factory function that translates an array of bytes into a UString, for > example: > > UString function u(string $s){ > return UString::fromXxx($s); > } > > $hello = u("hello"); > > where Xxx is the encoding of the source. But the u() function may do much > more > than this, and the implementation I have made provides several other > features: > > - Literal strings are chached, so if the u("hello") statement is executed > several times, only one single UString object is created once for all and > this > object is returned each time the u("hello") function is evaluated. Since > in a > source program the number of literal strings is finite, the string cache > will > result to be finite as well. > > - Automatically converts any type of data into UString, so that u(123) > yields > the same as UString::fromASCII( (string) 123 ). Here too, small numbers, > the > most common ones, can be cached. > > - If the argument is an object that implements the UPrintable interface, > its > __toUString() method is called; if it implements Printable, the > __toString() > method is called. Boolean values generate "FALSE" or "TRUE". NULL value > generates "NULL". etc. > > - If the argument is UString, gives itself. > > - If several arguments are provided, each argument is converted into > UString > and concatenated with UString::append(). > > All this can be implemented in PHP source, and does not require changes to > the > engine. The uecho() function does just the same, but also sends the result > to > stdout using the chosen encoding: > > uecho(...) ====> echo u(...)->toXxx(); > > where Xxx is the encoding corresponding to the that used in the u() > function to > translate literal strings. > > Programmers may then write something like this: > > function TenHelloWorld() { > $hello = u("hello"); > $world = u("world!"); > for($i = 0; $i < 10; $i++) > uecho($i, " - ", $hello, ", ", $world); > } > > This function generates 2 objects for the $hello and $world vars (cached), > 2 > objects for the " - " and the ", " literal strings (cached), 10 objects > for the > $i numbers (possibly cached), and other 10 objects for the resulting > concatenation of the strings. If this function is called again, cached > values > are reused again. > > > Support from the PHP engine > =========================== > > R1. First of all, the current implementation of the UString class, being > bare > PHP code, isn't very efficient, and a C implementation of all or at least > some > of the most critical sections of code could greatly improve performances. > Since > the PHP code developed around the UString class does not depend on the > internal > representation of the Unicode characters, several implementations may be > tested > and the final decision about the "standard" one can be postponed. Or, the > choice can be left to the users, that may choose the implementation that > better > fit their needs. > > R2. Second, the u() function requires some support from the PHP engine > because > it cannot be used in static expressions. The following code, for example, > is > not valid: > > function f( UString $s = u("xxx") ){ ... } > > class MyClass { > const > C = u("xxx"), > A = array( u("zero"), u("one"), u("two") ); > } > > If the PHP restriction imposed to the static expressions could be relaxed > a bit > at least for some magic function, the code above would be possible. > > R3. Third, only the engine may establish if a string that enter the u() > function is really a literal string and not a dynamically generated > string. For > example, in my current PHP implementation I can only warn the programmer > in the > documentation from doing things like > > for($i = 0; $i < 10000; $i) > uecho( "cycle no. $i" ); > > that would pollute the cache of u() with thousands of unuseful dynamically > generates strings. Or, even better, the PHP engine itself might split the > string and rewrite it automatically in a cache-aware way as: > > uecho("cycle no. ", $i); > > R4. Another area where some support from the PHP engine would be useful, > is the > detection of the encoding used in the source, so that the Xxx encoding to > be > used in the u() and uecho() functions can be automatically determined. In > my > current PHP implementation I stick with UTF-8, but a more general approach > may > take advantage from the new declare(encoding="Xxx") statement. For > example, the > engine might instantiate a "translator" object to be used for the current > source, and this translator object might be made available to the program > as a > global variable that tranlates from array of bytes to UString and > vice-versa: > > interface EncodingTranslator { > # Encoder call-back: > UString function encode(string $s); > # Decoder call-back: > string function decode(UString $u); > } > > # UTF-8 specific encoder/decoder functions pair: > class UTF8EncodingTranslator implements EncodingTranslator { > EncodingTranslator function getInstance(){...} > UString function encode(string $s){ return UString::fromUTF8($s); } > string function decode(UString $u){ return $u->toUTF8(); } > } > > # Other encoder/decoder functions pairs: > class ISO88591EncodingTranslator implements EncodingTranslator { ... } > class UTF16LEEncodingTranslator implements EncodingTranslator { ... } > ... > class ASCIIEncodingTranslator implements EncodingTranslator { ... } > > > # Here the PHP engine creates the per-source file translator object, > # setting the $curr_encoding_translator variable: > if( the engine detected this src is UTF8 encoded ){ > $curr_encoding_translator = new > UTF8EncodingTranslator::getInstance(); > } else if( the engine detected this src is ISO-8859-1 encoded ){ > $curr_encoding_translator = new > ISO88591EncodingTranslator::getInstance(); > ... > } else { > $curr_encoding_translator = new > ASCIIEncodingTranslator::getInstance(); > } > > > The u() function may then use the global variable > $curr_encoding_translator to > encode and decode every string and in every specific source program, that > is > $curr_encoding_translator change its value according to the source which is > currently under execution. In this way libraries can be developed > separately > with different source encodings without affecting the interoperability with > past and future programs. > > > Further developments > ==================== > > - I/O functions that support Unicode file names through modern dedicated > classes/functions: FileOutputStream, FileInputStream, etc. This is > particularly > required under Windows whose file system uses the UCS-2 encoding (in brief: > replace fopen() with _wfopen() etc. under Windows, or provide another > "hook" > that exposes these functions to PHP sources so that these classes can be > implemented in PHP code). > - String pattern matching, aka regex, but fully Unicode aware. > - Data base abstraction layer encoding-independent. > - A new generation of portable libraries. > > > Proof of the concept - The actual PHP implementation > ==================================================== > > The PHP implementation of all this is available either as documentation > and as > PHP source at this address: > > http://www.icosaedro.it/phplint/libraries.cgi > > Almost all the classes listed above are available, in particular: > > UString > UPattern (regex with UString) > utf8.php (provides u() and uecho() for UTF-8 only) > FileName (attempt to support Unicode file names on Linux and Win) > > > > Regards, > ___ > /_|_\ Umberto Salsi > \/_\/ www.icosaedro.it > > > -- > PHP Unicode & I18N Mailing List (http://www.php.net/) > To unsubscribe, visit: http://www.php.net/unsub.php > > -- Ferenc Kovács @Tyr43l - http://tyrael.hu