[PHP-DEV] Re: [PHP-I18N] Unicode support with UString abstraction layer

Ferenc Kovacs Sun, 22 Jul 2012 14:51:56 -0700

Hi,

I think that there aren't that many people subscribed to this list, so I'm
ccing the internals list, as your suggestion is to implement/bundle this to
the core.


On Wed, Mar 21, 2012 at 7:23 PM, Umberto Salsi <[email protected]> wrote:

> Although I never contributed to the code of the PHP project, I hope the
> ideas
> that follow may provide some suggestion for the future developments of the
> PHP
> language. The basic idea is described in the abstract, and those that are
> not
> interested may stop there :-)
>
>
> Abstract
> ========
>
> My modest proposal to provide Unicode support to PHP without the need to
> rewrite the whole engine and its libraries introducing the UString
> abstraction
> layer as a regular class, an with minimal support from the core engine.
> Basically, the UString class hides the internal implementation of the
> Unicode
> strings and allows to experiment with several solutions (UTF-8, UCS-2,
> UCS-4,
> ...). Several different implementations may be attempted and may also be
> made
> available, leaving to the user freedom to choose the best compromise
> between
> performances and memory footprint. Two useful functions, u() and uecho()
> are
> also discussed, that may help in writing Unicode interoperable source
> programs
> and libraries; these latter functions require some support from the PHP
> engine.
>
>
> The UString class
> =================
>
> 1. The UString class holds an immutable array of Unicode characters, it is
> final and hides the internal representation of the Unicode string, that
> may be
> UTF-8 or UCS-2 or anything else:
>
> final class UString
>         implements Hashable, Comparable, Sortable, UPrintable,
>                 Printable, Serializable
> { ... }
>
> (Besides the well known Serializable interface, the other implemented
> interfaces will be discussed later.)
>
> All the PHP programs and the external libraries based on this class are
> completely unaware of the actual internal encoding used. Several
> implementations may be provided to fit different needs. In western european
> countries the UTF-8 works well and some optimizations allow performances
> that
> are very close to ordinary "string" of bytes. For example, in my PHP
> implementation an internal byte index allows to scan quickly forward an
> backward the UTF-8 sequences.
>
> 2. The UString class has no constructor, but several factory methods that
> take
> arrays of bytes (aka string):
>
>         static UString function fromASCII(string $s)
>         static UString function fromUTF8(string $s)
>         static UString function fromISO88591(string $s)
>         static UString function fromUTF16LE(string $s)
>         static UString function fromUTF16BE(string $s)
>         ...
>
> These factory methods silently skip invalid bytes and invalid sequences,
> possibly replacing them with '?'. No warning, no exceptions. Other utility
> functions may be provided that check an array of bytes for a specific
> encoding.
>
> Several corresponding instance methods perform the revers traslation into
> an
> array of bytes:
>
>         string function toASCII()
>         string function toUTF8()
>         string function toISO88591()
>         string function toUTF16LE()
>         string function toUTF16BE()
>         ...
>
> So, for example,
>
>         UString::fromUTF8( $u->toUTF8() )->equals($u)
>
> is always TRUE for any Unicode string $u (the equals() method is describe
> below).
>
> 3. The UString class provides the usual string manipulation routines:
>
>         int function length()
>         UString function substring($from, $to)
>         UString function charAt($index)
>         UString function append(UString $u)
>         bool    function startsWith(UString $u)
>         bool    function endsWith(UString $u)
>         int     function indexOf(UString $u)
>         UString function trim($blacklist = u("\n\r\t"))
>         bool    function equalsIgnoreCase(UString $u)
>         UString function toUpperCase()
>         UString function toLowerCase()
>         UString[] function explode($separator = u(" "))
>         UString function implode($separator = u(""))
>         ...
>
> (More about the magic u() function later).
>
> 4. The UString class implements the UPrintable interface that returns "the
> best
> human-readable represesentation of the object as a UString string", that
> is the
> string itself:
>
>         UString function __toUString(){ return $this; }
>
> 5. The UString class implements the Printable interface that returns "the
> best
> human-readable representation of the object as a string, possibly composed
> of
> ASCII characters only":
>
>         string function __toString(){ return $this->toASCII(); }
>
> 6. The UString class implements the Hashable interface, useful to implement
> hashing algorithms (hasMap, HashSet, ...):
>
>         int function getHash(){ ... }
>
> Since UString is immutable, this function may compute the hash once for
> all. In
> my current PHP implementation I have used crc32(), but the PHP engine
> hides a
> more efficient hashing function that might be used instead (what about
> making
> it available in userland code?).
>
> 7. The UString class implements the Comparable interface:
>
>         bool function equals(object $u)
>
> that returns TRUE only if the object is UString and contains the same
> sequence
> of Unicode characters and returns FALSE in any other case.
>
> 8. The UString class implements the Sortable interface:
>
>         int function compareTo(object $u)
>
> that returns -1, 0 or +1 if $u is UString, or raises E_WARNING if $u is not
> UString.
>
>
>
> The u() and the uecho() magic functions
> =======================================
>
> Two "magic" functions helps in writing PHP programs. Basically, u() is a
> factory function that translates an array of bytes into a UString, for
> example:
>
>         UString function u(string $s){
>                 return UString::fromXxx($s);
>         }
>
>         $hello = u("hello");
>
> where Xxx is the encoding of the source. But the u() function may do much
> more
> than this, and the implementation I have made provides several other
> features:
>
> - Literal strings are chached, so if the u("hello") statement is executed
> several times, only one single UString object is created once for all and
> this
> object is returned each time the u("hello") function is evaluated. Since
> in a
> source program the number of literal strings is finite, the string cache
> will
> result to be finite as well.
>
> - Automatically converts any type of data into UString, so that u(123)
> yields
> the same as UString::fromASCII( (string) 123 ). Here too, small numbers,
> the
> most common ones, can be cached.
>
> - If the argument is an object that implements the UPrintable interface,
> its
> __toUString() method is called; if it implements Printable, the
> __toString()
> method is called. Boolean values generate "FALSE" or "TRUE". NULL value
> generates "NULL". etc.
>
> - If the argument is UString, gives itself.
>
> - If several arguments are provided, each argument is converted into
> UString
> and concatenated with UString::append().
>
> All this can be implemented in PHP source, and does not require changes to
> the
> engine. The uecho() function does just the same, but also sends the result
> to
> stdout using the chosen encoding:
>
>         uecho(...)  ====>   echo u(...)->toXxx();
>
> where Xxx is the encoding corresponding to the that used in the u()
> function to
> translate literal strings.
>
> Programmers may then write something like this:
>
>         function TenHelloWorld() {
>                 $hello = u("hello");
>                 $world = u("world!");
>                 for($i = 0; $i < 10; $i++)
>                         uecho($i, " - ", $hello, ", ", $world);
>         }
>
> This function generates 2 objects for the $hello and $world vars (cached),
> 2
> objects for the " - " and the ", " literal strings (cached), 10 objects
> for the
> $i numbers (possibly cached), and other 10 objects for the resulting
> concatenation of the strings. If this function is called again, cached
> values
> are reused again.
>
>
> Support from the PHP engine
> ===========================
>
> R1. First of all, the current implementation of the UString class, being
> bare
> PHP code, isn't very efficient, and a C implementation of all or at least
> some
> of the most critical sections of code could greatly improve performances.
> Since
> the PHP code developed around the UString class does not depend on the
> internal
> representation of the Unicode characters, several implementations may be
> tested
> and the final decision about the "standard" one can be postponed. Or, the
> choice can be left to the users, that may choose the implementation that
> better
> fit their needs.
>
> R2. Second, the u() function requires some support from the PHP engine
> because
> it cannot be used in static expressions. The following code, for example,
> is
> not valid:
>
> function f( UString $s = u("xxx") ){ ... }
>
> class MyClass {
>         const
>                 C = u("xxx"),
>                 A = array( u("zero"), u("one"), u("two") );
> }
>
> If the PHP restriction imposed to the static expressions could be relaxed
> a bit
> at least for some magic function, the code above would be possible.
>
> R3. Third, only the engine may establish if a string that enter the u()
> function is really a literal string and not a dynamically generated
> string. For
> example, in my current PHP implementation I can only warn the programmer
> in the
> documentation from doing things like
>
>         for($i = 0; $i < 10000; $i)
>                 uecho( "cycle no. $i" );
>
> that would pollute the cache of u() with thousands of unuseful dynamically
> generates strings. Or, even better, the PHP engine itself might split the
> string and rewrite it automatically in a cache-aware way as:
>
>         uecho("cycle no. ", $i);
>
> R4. Another area where some support from the PHP engine would be useful,
> is the
> detection of the encoding used in the source, so that the Xxx encoding to
> be
> used in the u() and uecho() functions can be automatically determined. In
> my
> current PHP implementation I stick with UTF-8, but a more general approach
> may
> take advantage from the new declare(encoding="Xxx") statement. For
> example, the
> engine might instantiate a "translator" object to be used for the current
> source, and this translator object might be made available to the program
> as a
> global variable that tranlates from array of bytes to UString and
> vice-versa:
>
> interface EncodingTranslator {
>         # Encoder call-back:
>         UString function encode(string $s);
>         # Decoder call-back:
>         string function decode(UString $u);
> }
>
> # UTF-8 specific encoder/decoder functions pair:
> class UTF8EncodingTranslator implements EncodingTranslator {
>         EncodingTranslator function getInstance(){...}
>         UString function encode(string $s){ return UString::fromUTF8($s); }
>         string function decode(UString $u){ return $u->toUTF8(); }
> }
>
> # Other encoder/decoder functions pairs:
> class ISO88591EncodingTranslator implements EncodingTranslator { ... }
> class UTF16LEEncodingTranslator implements EncodingTranslator { ... }
> ...
> class ASCIIEncodingTranslator implements EncodingTranslator { ... }
>
>
> # Here the PHP engine creates the per-source file translator object,
> # setting the $curr_encoding_translator variable:
> if( the engine detected this src is UTF8 encoded ){
>         $curr_encoding_translator = new
> UTF8EncodingTranslator::getInstance();
> } else if( the engine detected this src is ISO-8859-1 encoded ){
>         $curr_encoding_translator = new
> ISO88591EncodingTranslator::getInstance();
> ...
> } else {
>         $curr_encoding_translator = new
> ASCIIEncodingTranslator::getInstance();
> }
>
>
> The u() function may then use the global variable
> $curr_encoding_translator to
> encode and decode every string and in every specific source program, that
> is
> $curr_encoding_translator change its value according to the source which is
> currently under execution. In this way libraries can be developed
> separately
> with different source encodings without affecting the interoperability with
> past and future programs.
>
>
> Further developments
> ====================
>
> - I/O functions that support Unicode file names through modern dedicated
> classes/functions: FileOutputStream, FileInputStream, etc. This is
> particularly
> required under Windows whose file system uses the UCS-2 encoding (in brief:
> replace fopen() with _wfopen() etc. under Windows, or provide another
> "hook"
> that exposes these functions to PHP sources so that these classes can be
> implemented in PHP code).
> - String pattern matching, aka regex, but fully Unicode aware.
> - Data base abstraction layer encoding-independent.
> - A new generation of portable libraries.
>
>
> Proof of the concept - The actual PHP implementation
> ====================================================
>
> The PHP implementation of all this is available either as documentation
> and as
> PHP source at this address:
>
> http://www.icosaedro.it/phplint/libraries.cgi
>
> Almost all the classes listed above are available, in particular:
>
>         UString
>         UPattern (regex with UString)
>         utf8.php (provides u() and uecho() for UTF-8 only)
>         FileName (attempt to support Unicode file names on Linux and Win)
>
>
>
> Regards,
>  ___
> /_|_\  Umberto Salsi
> \/_\/  www.icosaedro.it
>
>
> --
> PHP Unicode & I18N Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>
>


-- 
Ferenc Kovács
@Tyr43l - http://tyrael.hu

[PHP-DEV] Re: [PHP-I18N] Unicode support with UString abstraction layer

Reply via email to