> Thanks for your very informative reply, Darren. I guess that maybe > PHP6 has implemented this from ICU. I was told by a PECL developer > that there is something in PHP6 but he didn't elaborate.
The intl extension: http://pecl.php.net/package/intl/ You can use it from php 5.2.4 onwards (or 5.2.3 with some modifications). Also see php|a magazine,Mar 2008. > The one I am using at the moment is: > http://derickrethans.nl/translit.php Thanks, I'd not heard of that. The Chinese conversion seems to be done by a huge lookup table, which is interesting. > Your work sounds interesting. I have downloaded your library, but am > having trouble navigating through it. Yes, fclib is quite informal :-). > What files should I be looking at for the transliteration? utf8.inc, e.g. fclib_katakana_to_hepburn_romaji(). See also my articles in php|a, Aug and Sep 2007. > I would like to be able to transliterate absolutely everything in > unicode. I have no idea if that is unreasonable as I am just getting > into character sets. I want them to make a bulletproof string to url > function for search engine friendliness and I also believe it is not > really a good thing to have high unicode in the url. For example > > Héllo Thìs is a URL Ælfred => hello-this-is-a-url-aelfred If URLs are the only concern I think I'd do this using urlencode(). What does a transliteration approach gain you? > Another thing that I started working on was a strtoupper, strtolower > and ucfirst function for cyrillic and anything else that can be upper > and lower case. However, being new to character set and unicode I am > having trouble converting the hex codes to actual character and > cannot get preg_replace to work with high unicode. See fclib_utf8_chr() and uniord() in utf8.inc, which are UTF-8 versions of PHP's chr() and ord() functions. I'm not sure about using preg as I'm not sure I've done it that way. The manual http://jp2.php.net/manual/en/regexp.reference.php has a section on unicode, but still doesn't seem to support giving a 4-character hex code. Perhaps you just use \x twice in a row? E.g. \x06\x28 to match U+0628 (Arabic BEH). Darren -- Darren Cook, Software Researcher/Developer http://dcook.org/mlsn/ (English-Japanese-German-Chinese-Arabic open source dictionary/semantic network) http://dcook.org/work/ (About me and my work) http://dcook.org/work/charts/ (My flash charting demos) -- PHP Unicode & I18N Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php