[PHP-DEV] Re: [PHP-I18N] Unicode support with UString abstraction layer

Ferenc Kovacs Fri, 31 Aug 2012 05:21:42 -0700

On Sun, Jul 22, 2012 at 11:50 PM, Ferenc Kovacs <[email protected]> wrote:


>
> On Wed, Mar 21, 2012 at 7:23 PM, Umberto Salsi <[email protected]> wrote:
>
>> Although I never contributed to the code of the PHP project, I hope the
>> ideas
>> that follow may provide some suggestion for the future developments of
>> the PHP
>> language. The basic idea is described in the abstract, and those that are
>> not
>> interested may stop there :-)
>>
>>
>> Abstract
>> ========
>>
>> My modest proposal to provide Unicode support to PHP without the need to
>> rewrite the whole engine and its libraries introducing the UString
>> abstraction
>> layer as a regular class, an with minimal support from the core engine.
>> Basically, the UString class hides the internal implementation of the
>> Unicode
>> strings and allows to experiment with several solutions (UTF-8, UCS-2,
>> UCS-4,
>> ...). Several different implementations may be attempted and may also be
>> made
>> available, leaving to the user freedom to choose the best compromise
>> between
>> performances and memory footprint. Two useful functions, u() and uecho()
>> are
>> also discussed, that may help in writing Unicode interoperable source
>> programs
>> and libraries; these latter functions require some support from the PHP
>> engine.
>>
>>
>> The UString class
>> =================
>>
>> 1. The UString class holds an immutable array of Unicode characters, it is
>> final and hides the internal representation of the Unicode string, that
>> may be
>> UTF-8 or UCS-2 or anything else:
>>
>> final class UString
>>         implements Hashable, Comparable, Sortable, UPrintable,
>>                 Printable, Serializable
>> { ... }
>>
>> (Besides the well known Serializable interface, the other implemented
>> interfaces will be discussed later.)
>>
>> All the PHP programs and the external libraries based on this class are
>> completely unaware of the actual internal encoding used. Several
>> implementations may be provided to fit different needs. In western
>> european
>> countries the UTF-8 works well and some optimizations allow performances
>> that
>> are very close to ordinary "string" of bytes. For example, in my PHP
>> implementation an internal byte index allows to scan quickly forward an
>> backward the UTF-8 sequences.
>>
>> 2. The UString class has no constructor, but several factory methods that
>> take
>> arrays of bytes (aka string):
>>
>>         static UString function fromASCII(string $s)
>>         static UString function fromUTF8(string $s)
>>         static UString function fromISO88591(string $s)
>>         static UString function fromUTF16LE(string $s)
>>         static UString function fromUTF16BE(string $s)
>>         ...
>>
>> These factory methods silently skip invalid bytes and invalid sequences,
>> possibly replacing them with '?'. No warning, no exceptions. Other utility
>> functions may be provided that check an array of bytes for a specific
>> encoding.
>>
>> Several corresponding instance methods perform the revers traslation into
>> an
>> array of bytes:
>>
>>         string function toASCII()
>>         string function toUTF8()
>>         string function toISO88591()
>>         string function toUTF16LE()
>>         string function toUTF16BE()
>>         ...
>>
>> So, for example,
>>
>>         UString::fromUTF8( $u->toUTF8() )->equals($u)
>>
>> is always TRUE for any Unicode string $u (the equals() method is describe
>> below).
>>
>> 3. The UString class provides the usual string manipulation routines:
>>
>>         int function length()
>>         UString function substring($from, $to)
>>         UString function charAt($index)
>>         UString function append(UString $u)
>>         bool    function startsWith(UString $u)
>>         bool    function endsWith(UString $u)
>>         int     function indexOf(UString $u)
>>         UString function trim($blacklist = u("\n\r\t"))
>>         bool    function equalsIgnoreCase(UString $u)
>>         UString function toUpperCase()
>>         UString function toLowerCase()
>>         UString[] function explode($separator = u(" "))
>>         UString function implode($separator = u(""))
>>         ...
>>
>> (More about the magic u() function later).
>>
>> 4. The UString class implements the UPrintable interface that returns
>> "the best
>> human-readable represesentation of the object as a UString string", that
>> is the
>> string itself:
>>
>>         UString function __toUString(){ return $this; }
>>
>> 5. The UString class implements the Printable interface that returns "the
>> best
>> human-readable representation of the object as a string, possibly
>> composed of
>> ASCII characters only":
>>
>>         string function __toString(){ return $this->toASCII(); }
>>
>> 6. The UString class implements the Hashable interface, useful to
>> implement
>> hashing algorithms (hasMap, HashSet, ...):
>>
>>         int function getHash(){ ... }
>>
>> Since UString is immutable, this function may compute the hash once for
>> all. In
>> my current PHP implementation I have used crc32(), but the PHP engine
>> hides a
>> more efficient hashing function that might be used instead (what about
>> making
>> it available in userland code?).
>>
>> 7. The UString class implements the Comparable interface:
>>
>>         bool function equals(object $u)
>>
>> that returns TRUE only if the object is UString and contains the same
>> sequence
>> of Unicode characters and returns FALSE in any other case.
>>
>> 8. The UString class implements the Sortable interface:
>>
>>         int function compareTo(object $u)
>>
>> that returns -1, 0 or +1 if $u is UString, or raises E_WARNING if $u is
>> not
>> UString.
>>
>>
>>
>> The u() and the uecho() magic functions
>> =======================================
>>
>> Two "magic" functions helps in writing PHP programs. Basically, u() is a
>> factory function that translates an array of bytes into a UString, for
>> example:
>>
>>         UString function u(string $s){
>>                 return UString::fromXxx($s);
>>         }
>>
>>         $hello = u("hello");
>>
>> where Xxx is the encoding of the source. But the u() function may do much
>> more
>> than this, and the implementation I have made provides several other
>> features:
>>
>> - Literal strings are chached, so if the u("hello") statement is executed
>> several times, only one single UString object is created once for all and
>> this
>> object is returned each time the u("hello") function is evaluated. Since
>> in a
>> source program the number of literal strings is finite, the string cache
>> will
>> result to be finite as well.
>>
>> - Automatically converts any type of data into UString, so that u(123)
>> yields
>> the same as UString::fromASCII( (string) 123 ). Here too, small numbers,
>> the
>> most common ones, can be cached.
>>
>> - If the argument is an object that implements the UPrintable interface,
>> its
>> __toUString() method is called; if it implements Printable, the
>> __toString()
>> method is called. Boolean values generate "FALSE" or "TRUE". NULL value
>> generates "NULL". etc.
>>
>> - If the argument is UString, gives itself.
>>
>> - If several arguments are provided, each argument is converted into
>> UString
>> and concatenated with UString::append().
>>
>> All this can be implemented in PHP source, and does not require changes
>> to the
>> engine. The uecho() function does just the same, but also sends the
>> result to
>> stdout using the chosen encoding:
>>
>>         uecho(...)  ====>   echo u(...)->toXxx();
>>
>> where Xxx is the encoding corresponding to the that used in the u()
>> function to
>> translate literal strings.
>>
>> Programmers may then write something like this:
>>
>>         function TenHelloWorld() {
>>                 $hello = u("hello");
>>                 $world = u("world!");
>>                 for($i = 0; $i < 10; $i++)
>>                         uecho($i, " - ", $hello, ", ", $world);
>>         }
>>
>> This function generates 2 objects for the $hello and $world vars
>> (cached), 2
>> objects for the " - " and the ", " literal strings (cached), 10 objects
>> for the
>> $i numbers (possibly cached), and other 10 objects for the resulting
>> concatenation of the strings. If this function is called again, cached
>> values
>> are reused again.
>>
>>
>> Support from the PHP engine
>> ===========================
>>
>> R1. First of all, the current implementation of the UString class, being
>> bare
>> PHP code, isn't very efficient, and a C implementation of all or at least
>> some
>> of the most critical sections of code could greatly improve performances.
>> Since
>> the PHP code developed around the UString class does not depend on the
>> internal
>> representation of the Unicode characters, several implementations may be
>> tested
>> and the final decision about the "standard" one can be postponed. Or, the
>> choice can be left to the users, that may choose the implementation that
>> better
>> fit their needs.
>>
>> R2. Second, the u() function requires some support from the PHP engine
>> because
>> it cannot be used in static expressions. The following code, for example,
>> is
>> not valid:
>>
>> function f( UString $s = u("xxx") ){ ... }
>>
>> class MyClass {
>>         const
>>                 C = u("xxx"),
>>                 A = array( u("zero"), u("one"), u("two") );
>> }
>>
>> If the PHP restriction imposed to the static expressions could be relaxed
>> a bit
>> at least for some magic function, the code above would be possible.
>>
>> R3. Third, only the engine may establish if a string that enter the u()
>> function is really a literal string and not a dynamically generated
>> string. For
>> example, in my current PHP implementation I can only warn the programmer
>> in the
>> documentation from doing things like
>>
>>         for($i = 0; $i < 10000; $i)
>>                 uecho( "cycle no. $i" );
>>
>> that would pollute the cache of u() with thousands of unuseful dynamically
>> generates strings. Or, even better, the PHP engine itself might split the
>> string and rewrite it automatically in a cache-aware way as:
>>
>>         uecho("cycle no. ", $i);
>>
>> R4. Another area where some support from the PHP engine would be useful,
>> is the
>> detection of the encoding used in the source, so that the Xxx encoding to
>> be
>> used in the u() and uecho() functions can be automatically determined. In
>> my
>> current PHP implementation I stick with UTF-8, but a more general
>> approach may
>> take advantage from the new declare(encoding="Xxx") statement. For
>> example, the
>> engine might instantiate a "translator" object to be used for the current
>> source, and this translator object might be made available to the program
>> as a
>> global variable that tranlates from array of bytes to UString and
>> vice-versa:
>>
>> interface EncodingTranslator {
>>         # Encoder call-back:
>>         UString function encode(string $s);
>>         # Decoder call-back:
>>         string function decode(UString $u);
>> }
>>
>> # UTF-8 specific encoder/decoder functions pair:
>> class UTF8EncodingTranslator implements EncodingTranslator {
>>         EncodingTranslator function getInstance(){...}
>>         UString function encode(string $s){ return UString::fromUTF8($s);
>> }
>>         string function decode(UString $u){ return $u->toUTF8(); }
>> }
>>
>> # Other encoder/decoder functions pairs:
>> class ISO88591EncodingTranslator implements EncodingTranslator { ... }
>> class UTF16LEEncodingTranslator implements EncodingTranslator { ... }
>> ...
>> class ASCIIEncodingTranslator implements EncodingTranslator { ... }
>>
>>
>> # Here the PHP engine creates the per-source file translator object,
>> # setting the $curr_encoding_translator variable:
>> if( the engine detected this src is UTF8 encoded ){
>>         $curr_encoding_translator = new
>> UTF8EncodingTranslator::getInstance();
>> } else if( the engine detected this src is ISO-8859-1 encoded ){
>>         $curr_encoding_translator = new
>> ISO88591EncodingTranslator::getInstance();
>> ...
>> } else {
>>         $curr_encoding_translator = new
>> ASCIIEncodingTranslator::getInstance();
>> }
>>
>>
>> The u() function may then use the global variable
>> $curr_encoding_translator to
>> encode and decode every string and in every specific source program, that
>> is
>> $curr_encoding_translator change its value according to the source which
>> is
>> currently under execution. In this way libraries can be developed
>> separately
>> with different source encodings without affecting the interoperability
>> with
>> past and future programs.
>>
>>
>> Further developments
>> ====================
>>
>> - I/O functions that support Unicode file names through modern dedicated
>> classes/functions: FileOutputStream, FileInputStream, etc. This is
>> particularly
>> required under Windows whose file system uses the UCS-2 encoding (in
>> brief:
>> replace fopen() with _wfopen() etc. under Windows, or provide another
>> "hook"
>> that exposes these functions to PHP sources so that these classes can be
>> implemented in PHP code).
>> - String pattern matching, aka regex, but fully Unicode aware.
>> - Data base abstraction layer encoding-independent.
>> - A new generation of portable libraries.
>>
>>
>> Proof of the concept - The actual PHP implementation
>> ====================================================
>>
>> The PHP implementation of all this is available either as documentation
>> and as
>> PHP source at this address:
>>
>> http://www.icosaedro.it/phplint/libraries.cgi
>>
>> Almost all the classes listed above are available, in particular:
>>
>>         UString
>>         UPattern (regex with UString)
>>         utf8.php (provides u() and uecho() for UTF-8 only)
>>         FileName (attempt to support Unicode file names on Linux and Win)
>>
>>
>>
>> Regards,
>>  ___
>> /_|_\  Umberto Salsi
>> \/_\/  www.icosaedro.it
>>
>>
>> --
>> PHP Unicode & I18N Mailing List (http://www.php.net/)
>> To unsubscribe, visit: http://www.php.net/unsub.php
>>
>>
> Hi,
>
> I think that there aren't that many people subscribed to this list, so I'm
> ccing the internals list, as your suggestion is to implement/bundle this to
> the core.
>
>
For the record there is another userland library targeting Unicode support
without external dependencies.
https://github.com/nicolas-grekas/Patchwork-UTF8
Currently it is considered to be included in Symfony2, so that they can
leverage the php extension dependencies.
See https://groups.google.com/forum/#!topic/symfony-devs/FtODyLi8OYk

-- 
Ferenc Kovács
@Tyr43l - http://tyrael.hu

[PHP-DEV] Re: [PHP-I18N] Unicode support with UString abstraction layer

Reply via email to