[PHP-I18N] Unicode support with UString abstraction layer

Umberto Salsi Wed, 21 Mar 2012 11:26:21 -0700

Although I never contributed to the code of the PHP project, I hope the ideas
that follow may provide some suggestion for the future developments of the PHP
language. The basic idea is described in the abstract, and those that are not
interested may stop there :-)



Abstract
========

My modest proposal to provide Unicode support to PHP without the need to
rewrite the whole engine and its libraries introducing the UString abstraction
layer as a regular class, an with minimal support from the core engine.
Basically, the UString class hides the internal implementation of the Unicode
strings and allows to experiment with several solutions (UTF-8, UCS-2, UCS-4,
...). Several different implementations may be attempted and may also be made
available, leaving to the user freedom to choose the best compromise between
performances and memory footprint. Two useful functions, u() and uecho() are
also discussed, that may help in writing Unicode interoperable source programs
and libraries; these latter functions require some support from the PHP engine.


The UString class
=================

1. The UString class holds an immutable array of Unicode characters, it is
final and hides the internal representation of the Unicode string, that may be
UTF-8 or UCS-2 or anything else:

final class UString
        implements Hashable, Comparable, Sortable, UPrintable,
                Printable, Serializable
{ ... }

(Besides the well known Serializable interface, the other implemented
interfaces will be discussed later.)

All the PHP programs and the external libraries based on this class are
completely unaware of the actual internal encoding used. Several
implementations may be provided to fit different needs. In western european
countries the UTF-8 works well and some optimizations allow performances that
are very close to ordinary "string" of bytes. For example, in my PHP
implementation an internal byte index allows to scan quickly forward an
backward the UTF-8 sequences.

2. The UString class has no constructor, but several factory methods that take
arrays of bytes (aka string):

        static UString function fromASCII(string $s)
        static UString function fromUTF8(string $s)
        static UString function fromISO88591(string $s)
        static UString function fromUTF16LE(string $s)
        static UString function fromUTF16BE(string $s)
        ...

These factory methods silently skip invalid bytes and invalid sequences,
possibly replacing them with '?'. No warning, no exceptions. Other utility
functions may be provided that check an array of bytes for a specific encoding.

Several corresponding instance methods perform the revers traslation into an
array of bytes:

        string function toASCII()
        string function toUTF8()
        string function toISO88591()
        string function toUTF16LE()
        string function toUTF16BE()
        ...

So, for example,

        UString::fromUTF8( $u->toUTF8() )->equals($u)

is always TRUE for any Unicode string $u (the equals() method is describe
below).

3. The UString class provides the usual string manipulation routines:

        int function length()
        UString function substring($from, $to)
        UString function charAt($index)
        UString function append(UString $u)
        bool    function startsWith(UString $u)
        bool    function endsWith(UString $u)
        int     function indexOf(UString $u)
        UString function trim($blacklist = u("\n\r\t"))
        bool    function equalsIgnoreCase(UString $u)
        UString function toUpperCase()
        UString function toLowerCase()
        UString[] function explode($separator = u(" "))
        UString function implode($separator = u(""))
        ...

(More about the magic u() function later).

4. The UString class implements the UPrintable interface that returns "the best
human-readable represesentation of the object as a UString string", that is the
string itself:

        UString function __toUString(){ return $this; }

5. The UString class implements the Printable interface that returns "the best
human-readable representation of the object as a string, possibly composed of
ASCII characters only":

        string function __toString(){ return $this->toASCII(); }

6. The UString class implements the Hashable interface, useful to implement
hashing algorithms (hasMap, HashSet, ...):

        int function getHash(){ ... }

Since UString is immutable, this function may compute the hash once for all. In
my current PHP implementation I have used crc32(), but the PHP engine hides a
more efficient hashing function that might be used instead (what about making
it available in userland code?).

7. The UString class implements the Comparable interface:

        bool function equals(object $u)

that returns TRUE only if the object is UString and contains the same sequence
of Unicode characters and returns FALSE in any other case.

8. The UString class implements the Sortable interface:

        int function compareTo(object $u)

that returns -1, 0 or +1 if $u is UString, or raises E_WARNING if $u is not
UString.



The u() and the uecho() magic functions
=======================================

Two "magic" functions helps in writing PHP programs. Basically, u() is a
factory function that translates an array of bytes into a UString, for example:

        UString function u(string $s){
                return UString::fromXxx($s);
        }
        
        $hello = u("hello");

where Xxx is the encoding of the source. But the u() function may do much more
than this, and the implementation I have made provides several other features:

- Literal strings are chached, so if the u("hello") statement is executed
several times, only one single UString object is created once for all and this
object is returned each time the u("hello") function is evaluated. Since in a
source program the number of literal strings is finite, the string cache will
result to be finite as well.

- Automatically converts any type of data into UString, so that u(123) yields
the same as UString::fromASCII( (string) 123 ). Here too, small numbers, the
most common ones, can be cached.

- If the argument is an object that implements the UPrintable interface, its
__toUString() method is called; if it implements Printable, the __toString()
method is called. Boolean values generate "FALSE" or "TRUE". NULL value
generates "NULL". etc.

- If the argument is UString, gives itself.

- If several arguments are provided, each argument is converted into UString
and concatenated with UString::append().

All this can be implemented in PHP source, and does not require changes to the
engine. The uecho() function does just the same, but also sends the result to
stdout using the chosen encoding:

        uecho(...)  ====>   echo u(...)->toXxx();

where Xxx is the encoding corresponding to the that used in the u() function to
translate literal strings.

Programmers may then write something like this:

        function TenHelloWorld() {
                $hello = u("hello");
                $world = u("world!");
                for($i = 0; $i < 10; $i++)
                        uecho($i, " - ", $hello, ", ", $world);
        }

This function generates 2 objects for the $hello and $world vars (cached), 2
objects for the " - " and the ", " literal strings (cached), 10 objects for the
$i numbers (possibly cached), and other 10 objects for the resulting
concatenation of the strings. If this function is called again, cached values
are reused again.


Support from the PHP engine
===========================

R1. First of all, the current implementation of the UString class, being bare
PHP code, isn't very efficient, and a C implementation of all or at least some
of the most critical sections of code could greatly improve performances. Since
the PHP code developed around the UString class does not depend on the internal
representation of the Unicode characters, several implementations may be tested
and the final decision about the "standard" one can be postponed. Or, the
choice can be left to the users, that may choose the implementation that better
fit their needs.

R2. Second, the u() function requires some support from the PHP engine because
it cannot be used in static expressions. The following code, for example, is
not valid:

function f( UString $s = u("xxx") ){ ... }

class MyClass {
        const
                C = u("xxx"),
                A = array( u("zero"), u("one"), u("two") );
}

If the PHP restriction imposed to the static expressions could be relaxed a bit
at least for some magic function, the code above would be possible.

R3. Third, only the engine may establish if a string that enter the u()
function is really a literal string and not a dynamically generated string. For
example, in my current PHP implementation I can only warn the programmer in the
documentation from doing things like

        for($i = 0; $i < 10000; $i)
                uecho( "cycle no. $i" );

that would pollute the cache of u() with thousands of unuseful dynamically
generates strings. Or, even better, the PHP engine itself might split the
string and rewrite it automatically in a cache-aware way as:

        uecho("cycle no. ", $i);

R4. Another area where some support from the PHP engine would be useful, is the
detection of the encoding used in the source, so that the Xxx encoding to be
used in the u() and uecho() functions can be automatically determined. In my
current PHP implementation I stick with UTF-8, but a more general approach may
take advantage from the new declare(encoding="Xxx") statement. For example, the
engine might instantiate a "translator" object to be used for the current
source, and this translator object might be made available to the program as a
global variable that tranlates from array of bytes to UString and vice-versa:

interface EncodingTranslator {
        # Encoder call-back:
        UString function encode(string $s);
        # Decoder call-back:
        string function decode(UString $u);
}

# UTF-8 specific encoder/decoder functions pair:
class UTF8EncodingTranslator implements EncodingTranslator {
        EncodingTranslator function getInstance(){...}
        UString function encode(string $s){ return UString::fromUTF8($s); }
        string function decode(UString $u){ return $u->toUTF8(); }
}

# Other encoder/decoder functions pairs:
class ISO88591EncodingTranslator implements EncodingTranslator { ... }
class UTF16LEEncodingTranslator implements EncodingTranslator { ... }
...
class ASCIIEncodingTranslator implements EncodingTranslator { ... }


# Here the PHP engine creates the per-source file translator object,
# setting the $curr_encoding_translator variable:
if( the engine detected this src is UTF8 encoded ){
        $curr_encoding_translator = new UTF8EncodingTranslator::getInstance();
} else if( the engine detected this src is ISO-8859-1 encoded ){
        $curr_encoding_translator = new 
ISO88591EncodingTranslator::getInstance();
...
} else {
        $curr_encoding_translator = new ASCIIEncodingTranslator::getInstance();
}


The u() function may then use the global variable $curr_encoding_translator to
encode and decode every string and in every specific source program, that is
$curr_encoding_translator change its value according to the source which is
currently under execution. In this way libraries can be developed separately
with different source encodings without affecting the interoperability with
past and future programs.


Further developments
====================

- I/O functions that support Unicode file names through modern dedicated
classes/functions: FileOutputStream, FileInputStream, etc. This is particularly
required under Windows whose file system uses the UCS-2 encoding (in brief:
replace fopen() with _wfopen() etc. under Windows, or provide another "hook"
that exposes these functions to PHP sources so that these classes can be
implemented in PHP code).
- String pattern matching, aka regex, but fully Unicode aware.
- Data base abstraction layer encoding-independent.
- A new generation of portable libraries.


Proof of the concept - The actual PHP implementation
====================================================

The PHP implementation of all this is available either as documentation and as
PHP source at this address:

http://www.icosaedro.it/phplint/libraries.cgi

Almost all the classes listed above are available, in particular:

        UString
        UPattern (regex with UString)
        utf8.php (provides u() and uecho() for UTF-8 only)
        FileName (attempt to support Unicode file names on Linux and Win)



Regards,
 ___
/_|_\  Umberto Salsi
\/_\/  www.icosaedro.it


-- 
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

[PHP-I18N] Unicode support with UString abstraction layer

Reply via email to