[PHP-I18N] RE: [PHP-DEV] Unicode and MBCS support using ICU/xIUA

Carl W. Brown Sat, 13 Oct 2001 11:39:42 -0700

Andi,

Things have change a lot in that past year or so.  Part of my effort has
also been to contribute code to ICU so that it will work well with xIUA.
With ICU release 1.8.1 http://oss.software.ibm.com/icu/ I now feel
comfortable using this synthesis.  In the mean time I have extended the
capabilities of what was once XICU to something that I am comfortable
releasing as open source code xIUA 3.1 http://www.xnetinc.com/xiua/.


Were we got bogged down before was with work to add 16 bit Unicode as a data
type to PHP.  This was only UCS-2 support.  The difference is that now with
Unicode 3.1 you need access to the additional 16 Unicode planes.  With
UTF-16 you use a pair of surrogate characters to represent characters
between U+FFFF and U+10FFFF.  This makes it not only a 16 bit character set
but another MBCS characters set.  I do not think that this it the right
approach for PHP.  When I last looks PHP stored everything including numbers
as strings.  xIUA allows you to use string much the same way.  If I invoke
xiua_strlen on data that is UTF-32 it will invoke xiu4_strlen and will
return the number of UTF-32 characters in the times 4.

Some operations don't work well.  strncpy has real problems with i18n
applications.  If I am copying MBCS or UTF-8 data it make copy a partial
character and if I terminate a string with a null at the end of the buffer I
may end up with a string that has a broken character.  xiua_strcpuEx solves
this problem.  It will copy as many whole characters as will fit and always
adds a null to the end of the string.  This means that it will also work
with UTF-16 and UTF-32 with 16 and 32 bit null characters.

While I was doing this I also added support for CP932, CP936, CP949, CP950,
Shift JIS, KSC_5601, Big 5, GB3212, GB18030, EUC_JP, EUC_KR, EUC_TW, EUC_CN.

There are routines to validate a character set, give a count of characters
move forward and back one character etc.

Carl

> -----Original Message-----
> From: Andi Gutmans [mailto:[EMAIL PROTECTED]]
> Sent: Friday, October 12, 2001 6:39 PM
> To: Carl W. Brown; [EMAIL PROTECTED]
> Subject: Re: [PHP-DEV] Unicode and MBCS support using ICU/xIUA
>
>
> I would love to finally see good i18n support in PHP.
> We setup a PHP i18n mailing list a while ago but nothing seems to have
> happened with it. How about we move the discussion there and
> resurrect the
> issue? (the mailing list is [EMAIL PROTECTED]; to subscribe:
> [EMAIL PROTECTED]).
>
> Andi
>
> At 04:31 PM 10/12/2001 -0700, Carl W. Brown wrote:
> >About a year and a half ago I developed code to enable PHP3 to support
> >Unicode using ICU.  These was some initial interest but it died fast so I
> >did not look into what it would take to develop a PHP4 solution.  The
> >original problem was adding UTF-16 (UCS-2) support to PHP. In
> the mean time
> >I have perfected the ICU interface code and have recently made
> it available
> >is open source code.
> >
> >It is thread safe cross-platform and not only supports all forms
> of Unicode
> >(UTF-8, UTF-16 & UTF-32) but it supports code page data with the
> same code.
> >It is great for browsers because it can use the same functions to process
> >code page and Unicode data dynamically so that the code does not
> change if
> >you are using UTF-8 for one browser and EUC-JP for another.  It has new
> >functions to make PHP charset handling easier.  No there is no
> need to add
> >16 bit data types to PHP.
> >
> >I added the Unicode support to PHP3 but making it a
> semi-resident module. It
> >also required some minor changes to the HTTP header processing.  With the
> >new xIUA code the changes to PHP are far less.  It also make PHP
> thread safe
> >in that each thread can have different locales.  This is something that
> >setlocale does not provide.
> >
> >PHP internal data types are interanlly char *.  This would be how I would
> >imagine implementing Unicode in PHP. The internal data is stored
> as char *
> >strings even if it is UTF-16 or UTF-32 data. xIUA will sort it out
> >automatically. If the data in a string is converted that the date type
> >attribute changes to match to internal format but the string is
> still char
> >*.
> >This would involve a minimal set of changes and would be backwards
> >compatible. If for example I collate a UTF-8 string with a
> UTF-16 string the
> >UTF-8 string is fast transformed to a UTF-16 string in a stack storage
> >mechnisim and it is collated. If the UTF-16 string is
> concatenated to UTF-8
> >data it is converted to UTF-8. The user does not care it is just data. If
> >send a EUC-JP script page to a Shift_JIS browser it gets converted.
> >
> >xIUA contains support for most of the poplar browser code pages
> so it can be
> >used for code page support as well.  It also contains functions that will
> >convert strftime date time formats to ICU formats using the ICU internal
> >data files to produce locale correct formats.  It ahs reotines to parse
> >accpt-language strings or fint the best characters set for a
> specific locale
> >basied on accept-charset strings.   It has Appahes mads that allow you to
> >organiage PHP scripts that have been localeized into different
> >subdirectories by locale and it will optionaly override the mod_mine
> >processing.  It has code that will foce Apache to call the PHP
> per directory
> >processing so that setting can vary by directory.
> >
> >I am not sure how much of this has changed in PHP4 but the code
> is there if
> >you need it.
> >
> >Carl
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >--
> >PHP Development Mailing List <http://www.php.net/>
> >To unsubscribe, e-mail: [EMAIL PROTECTED]
> >For additional commands, e-mail: [EMAIL PROTECTED]
> >To contact the list administrators, e-mail: [EMAIL PROTECTED]
>


-- 
PHP Internationalization Mailing List (http://www.php.net/)
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
To contact the list administrators, e-mail: [EMAIL PROTECTED]

[PHP-I18N] RE: [PHP-DEV] Unicode and MBCS support using ICU/xIUA

Reply via email to