Hi Perl Unicode geeks, I'm currently making our web application (MKDoc) support more than western european languages. Being a French lad working in England that's currently learning Japanese (mainly because of anime movies I admit it :)), I thought I had to do it the "Right Way", i.e. going all the way with Unicode.
One of the few problems I've been running into with Unicode is to build human readable URIs from Unicode strings. Indeed it's not that much of a deal when constructing URIs from English titles, but it becomes a bit less obvious when "URLizing" from languages such as punjabi or gujurati. In order to solve this I wrote an XS wrapper around IBM's ICU 2.0 libraries (attached) which I'm on the process of putting on CPAN. It neatly wraps ICU transliteration services, which cover plenty of languages / character sets, etc. Another cool thing about it that it eases document indexing. It is actually possible to transliterate them first and then store a bunch of plain old ASCII keywords which make all databases very happy. Besides, it makes it possible to perform searches based on transliterated ASCII string, which is nice when you don't have a Punjabi keyboard to input search keywords for instance. However I'm having quite a lot of trouble with Japanese because of Kanjis (chinese ideograms). ICU does provide Hiragana <=> Latin and Katakana <=> Latin, but doesn't do anything about kanji. Which does not surprise me too much given the fact that in Japanese a kanji has very often more than one pronunciation depending on how and where it's used. Another problem with Japanese is that it seems to me that words are not separated by spaces. Therefore even if the transliteration worked for Kanjis I'd end up with lots of endless strings, which is not good for indexing when you try to split text into keywords. Any ideas? I'm quite worried about the fact that I have a webapp that works perfectly for Punjabi but that kind of screws Japanese up when creating new documents and performing searches :-( Cheers, -- IT'S TIME FOR A DIFFERENT KIND OF WEB ================================================================ Jean-Michel Hiver - Software Director [EMAIL PROTECTED] +44 (0)114 221 4968 ================================================================ VISIT HTTP://WWW.MKDOC.COM
Unicode-Transliterate-0.2.tgz
Description: application/tar-gz