Re: [PHP-I18N] Unicode Transliteration & ICU

Darren Cook Mon, 09 Jun 2008 16:05:46 -0700

> Thanks for your very informative reply, Darren. I guess that maybe
> PHP6 has implemented this from ICU. I was told by a PECL developer
> that there is something in PHP6 but he didn't elaborate.


The intl extension:
  http://pecl.php.net/package/intl/
You can use it from php 5.2.4 onwards (or 5.2.3 with some
modifications). Also see php|a magazine,Mar 2008.

> The one I am using at the moment is: 
> http://derickrethans.nl/translit.php

Thanks, I'd not heard of that. The Chinese conversion seems to be done
by a huge lookup table, which is interesting.

> Your work sounds interesting. I have downloaded your library, but am 
> having trouble navigating through it.

Yes, fclib is quite informal :-).

> What files should I be looking at for the transliteration?

utf8.inc, e.g. fclib_katakana_to_hepburn_romaji().
See also my articles in php|a, Aug and Sep 2007.

> I would like to be able to transliterate absolutely everything in 
> unicode. I have no idea if that is unreasonable as I am just getting 
> into character sets. I want them to make a bulletproof string to url 
> function for search engine friendliness and I also believe it is not 
> really a good thing to have high unicode in the url. For example
> 
> Héllo Thìs is a URL Ælfred => hello-this-is-a-url-aelfred

If URLs are the only concern I think I'd do this using urlencode(). What
does a transliteration approach gain you?

> Another thing that I started working on was a strtoupper, strtolower
> and ucfirst function for cyrillic and anything else that can be upper
> and lower case. However, being new to character set and unicode I am
> having trouble converting the hex codes to actual character and
> cannot get preg_replace to work with high unicode.

See fclib_utf8_chr() and uniord() in utf8.inc, which are UTF-8 versions
of PHP's chr() and ord() functions.

I'm not sure about using preg as I'm not sure I've done it that way. The
manual http://jp2.php.net/manual/en/regexp.reference.php has a section
on unicode, but still doesn't seem to support giving a 4-character hex
code. Perhaps you just use \x twice in a row? E.g.
  \x06\x28
to match U+0628 (Arabic BEH).

Darren


-- 
Darren Cook, Software Researcher/Developer
http://dcook.org/mlsn/ (English-Japanese-German-Chinese-Arabic
                        open source dictionary/semantic network)
http://dcook.org/work/ (About me and my work)
http://dcook.org/work/charts/  (My flash charting demos)


-- 
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-I18N] Unicode Transliteration & ICU

Reply via email to