On Mon, 11 Sep 2000, rigel wrote: > 幾經斟酌, 我最終決定使用部首+筆畫排序, 原因如下: > > 1. Unicode中的漢字是採用部首+筆畫排序的, 部首+筆畫的順序就是Unicode編碼的順序. > 目前locale定義文件中的編碼全用Unicode, [EMAIL PROTECTED]@列出, 文件 > 簡捷, [EMAIL PROTECTED]
This is the simplest solution. However, does this mean all the characters in CJK Extension A (U+3400 ...) come first, followed by the characters in CJK Unified Ideographs (U+4E00 ...), and finally followed by CJK Extension B (U+20000 ...)[1], no matter what radical or stroke count? And there may be CJK Extension C to deal with, if/when it comes out... [1] Coming out very soon with Unicode 3.1 and ISO 10646-2:2001. > 2. 在Unicode的官方文件中, 大約有7000漢字沒有拼音. 這些漢字都是冷僻字或來自日韓 > 的漢字, 由我們自己賦予拼音, 是件浩大的工程, 幾近不可能. 這是我放棄拼音排序 > 的最主要原因. (如果您知道更全且據權威性的mapping table, 請告我). There are a lot of problems with assigning pronunications (pinyin or otherwise) to every character. Almost all Japanese, Korean, and Vietnamese ones do not have Chinese readings, and Chinese dialect characters usually don't have Pinyin readings, as well as many characters that no one knows the readings for and/or what they mean but they are listed in the large dictionaries. e.g., how many people know the Pinyin reading for 囝 (子 inside 囗) is jian? (It means 'child' in Min 閩語.) ~7000 missing readings is also nothing compared to the 40,000+ missing readings for the characters in CJK Extension B! Who wants to fill them in? :( > ha>ftp://ftp.cuhk.hk/pub/chinese/ifcss/software/data/ has a > ha>Uni2Pinyin.gz file. But it is kinda old (1996). There is a file in the Big5+ package from CMEX (http://www.cmex.org.tw/) which gives readings (in Zhuyin Fuhao, but they can be converted) for most of the 20,902 characters of Unicode 1.1. I recall they used one dictionary for most of them (I think Taiwan pronunciation standards), and then another for a few others, and left the dialect and Japanese/Korean ones blank. > ri> was based on Unicode 1.1. I'm also reluctant to accept any ad hoc > ri> mapping tables, and prefer those from international or > ri> national standard bodies, or credible research institutes. I do not trust just any source, either. Many do not document where their information came from, such as the UNIHAN.TXT file, or even if they are a composite of multiple sources! > ha> Well, I would guess that most hanzi that had multiple pronunciations > ha> are frequently used ones. Frequently used hanzi are sorted in GB2312 > ha> according to pinyin. Pinyin for hanzi with multiple pronunciations > are > ha> decided by the most frequently used pronunciation for the hanzi. I > ha> find is solution is clean and simple. So a map to the GB2312 can > ha> be used when there are ambiguity. > ri> Another problem with the existence of multi-pronunciation is that > ri> the programmers can not reliably depend on the collation based > ri> sorting. Because one can not assume which pronunciation the user > ri> intended to. Going with the most frequent pronunciation is not a > ri> solution, because sometimes user might indeed look for a rarely > ri> used pronunciation. Actually, from what I've seen in the _Hanyu Da Zidian_ 漢語大字典, a lot of characters, including infrequently-used ones, have multiple readings. A lot of frequent characters have multiple readings, but most people don't know about them. e.g., 她 is usually ta, but can also be chi (used in girl's names) or jie (same as 姐). > ha> The stroke-count order has long history of acceptance in China is > ha> related to that there was no pronunciation standard and a symbolic > ha> system to represent the pronunciation. (Okey, I don't know much > about > ha> it. Just my personal impression.) Even since we have a standard > pinyin Well, there are always the pronounciation-based orders in rhymebooks like the _Guangyun_ 廣韻, but very few people would be able to use such a sorting order--maybe your literature professor. :) Plus it doesn't have every character out there... Radical and stroke count has the advantage of only requiring you to see the character. (There are disagreements over which radical and how many strokes sometimes, but relatively minor.) Thomas Chan [EMAIL PROTECTED]

