On Friday, Aug 29, 2003, at 16:07 Asia/Tokyo, Nick Ing-Simmons wrote:
Jarkko Hietaniemi <[EMAIL PROTECTED]> writes:
On Thu, Aug 28, 2003 at 03:16:20PM +0100, [EMAIL PROTECTED] wrote:

Does the existing perl5.8.* Unicode support have a way to efficently
determine which script(s) or block (in unicode sense) a code point belongs
to?

use Unicode::UCD qw(charscript charblock); print charscript(0x0388); print charblock (0x30a0);

Great.

But that is not good enough for cases below because...


(Hiragana | Katakana | Han) => 'jisx0208.1990-0'

This is very wrong because jisx0208.1990-0 only contains \p{Han} that appears in Japanese (JIS X 0208, to be exact). On the other hand, jisx0208.1990-0 does contain greek and cyrillic alphabets.


One of so many reasons why Han Unification was a bad idea. When it comes to Han Ideographs, Unicode's sense of charscript is almost useless.

\x{5c0f}\x{98fc} \x{5f3e}



Reply via email to