On Sat, 31 Mar 2001 [EMAIL PROTECTED] wrote:
> > From: Markus Kuhn <[EMAIL PROTECTED]>
> > - lots of historic ideographs have been added in Plane 02
> ^^^^^^^^^^^^^^^^^^^
> In accurate, the ideographs in plane 2 is not necessarily be
> classified as historic. They are in fact the mixture of modern and
> historic ideographs.
(Note: I am not a subscriber to the [EMAIL PROTECTED] mailing list.)
Let us examine the composition of CJK Extension B, using the most
recent version (2001.3.27) of the Unihan database[1]. I have added some
dates and merged some sources together, where appropriate. Note that a
character may come from more than one source, so the numbers will not
add
up to 42,711.
G_KX Kangxi Zidian (1716) 18487 dictionary
G_HZ Hanyu Da Zidian (1986) 10501 dictionary
G_CY Ci Yuan (20th c.) 66 dictionary
G_CH Ci Hai (20th c.) 247 dictionary
G_HC Hanyu Da Cidian (20th c.) 553 dictionary
G_BK Zhongguo Da Baike Quanshu (20th c.) 87 encyclopedia
G_FZ Fangzheng Paiban Xitong (20th c.) 65 publishing sys
G_4K Siku Quanshu (19th c.) 522 collectanea
H HKSCS (1999) 1081 character set
T4/T5/T6/T7/TF CNS 11643-1992 30177 character set
J3/J4 JIS X 0213: 2000 302 character set
K4 PKS 5700-3: 1998 166 fake char set
V0 TCVN 5773: 1993 1515 character set
V2/V3 VHN 01: 1998 and VHN 02: 1998 2717 fake (?) char
set
Unicode Standard Annex #27[2] also states in section 10.1 that the G_KX
and G_HZ sources encode Han characters that are not already encoded in
the
BMP. The G source data is hard to interpret, since a character may only
belong to one of the above G sources, even though it is pretty obvious
that a character can and will occur in more than one of the above
dictionaries/encyclopedias/collectanea. From the large numbers of G_KX
and
G_HZ, it seems in the Unihan database, the choice has been made to count
them as one of those two sources, and probably G_KX has precedence over
G_HZ if a character appears in both of those sources. However, if we
disregard hardcopy sources, then the only G source that matters for
purposes of assessing the amount of legacy (electronic) data is the G_FZ
source, a publishing system. Unfortunately, the degree to which that
publishing system overlaps with other G sources cannot be answered
solely
with data from the Unihan database. But does the typical user have
access
to publishing system like that?--of course not.
The other sources, H, T, K, and V, are all character sets, or claim to
be,
and serve as a better gauge of legacy (electronic) data. For purposes
of
discussion, I will define a "typical user" as one using (and generating
data on) a mainstream platform (e.g., Windows).
The K4 source can be eliminated right away, as it is a fake--a South
Korean character set would have a "KS C" (old) or "KS X" prefix; K4 is
merely a fancy way to say "South Korean submissions". Korean text is
also
strongly hangul-biased, and the typical user only has access to
characters
in KS X 1001 (which are all in the BMP), if and when they use them. If
they're lucky, they might have also access to characters in KS X 1002,
but
those are also in the BMP. In any case, there aren't any South Korean
character sets represented in CJK Extension B, so we don't have to worry
about atypical users.
For similar reasons, the V2 and V3 sources also look suspect, as they do
not have the "TCVN" prefix of Vietnamese standards. V0 is a real
character set, but it consists of historic chu+~ no^m characters. In
any
case, Vietnamese is now written in Latin script, so any V source may be
regarded as historic.
The T sources are in contemporary use in Taiwan, but the de facto
industry
character set is Big5 (all in the BMP), and not CNS 11643. Therefore,
the
typical user will not have legacy (electronic) data, although atypical
users will, and 30,177 is a very large fraction of CJK Extension B!
The J3 and J4 sources, to my knowledge, aren't available to the typical
Japanese user, who only has access to characters in JIS X 0208 (all in
the
BMP), because of sorry Shift-JIS encoding. Even if they are fortunate
enough to use EUC-JP and have access to characters in JIS X 0212, those
are also in the BMP. Because of its newness (2000), I'm not sure how
many
atypical users have access to the characters in JIS X 0213, but 302 is a
rather small number.
Perhaps the greatest source of legacy (electronic) data is the H source,
since HKSCS is available to the typical Hong Kong user, who can (on
Windows platforms) and does (e.g., online newspaper sites expect
one to have it) install OS-level extension support to the base Big5
being
used. 1081 is not a small number of characters for a human, but small
in
relation to the size of CJK Extension B.
In terms of legacy (electronic) data coming from typical users that
would
map to Plane 2, it is basically just some ~1000 characters from users of
the HKSCS character set. However, if atypical users using more "exotic"
character sets such as the T sources (CNS 11643) or J sources (JIS X
0213) are considered, then the number rises dramatically to over
30,000--over 60% of Plane 2's characters! (I'm not going to discuss the
issue of users using publishing systems, because: lack of data; they are
specialist users, like historians and academics who do use the historic
characters; and publishing systems are not for data exchange--at least,
not in conjunction with applications like xterm.)
What decision should one make? Well, if we can get those Hong Kong
users
to move to Unicode and ditch legacy Big5-HKSCS encoding by providing a
functionally equivalent replacement, then that's always a good thing.
(I
admit to having a personal interest here, since I would like to be able
to
write in Cantonese.) Also, in the near future, JIS X 0213 will become
available to the typical user in Japan, and likewise GB 18030 in
mainland
China, which makes characters in Plane 2 fair game to be used.
[1] http://www.unicode.org/Public/3.1-Update/Unihan-3.1.txt
[2] http://unicode.org/unicode/reports/tr27/
> > To be honest, I don't think that support for non-BMP characters in
> > terminal emulators is a particularly urgent issue, as the non-BMP
> > characters are unlikely to be of any real use to the vast majority of
> > terminal emulator users.
>
> I have to say this is unfortunately a false assumtion, of course
> it depends on the definition of the vast majority of terminal emulator
> users.
>
> One obvious example is JIS 0213 support.
Non-BMP characters are going to be important to someone somewhere, so
they
should be supported eventually. However, it is only fair that support
for
characters that have been encoded earlier (which happens to be BMP
characters, for now) should be implemented first--users who need to use
them have been waiting much longer, in some cases since Unicode 1.1
(1993), e.g., Indic scripts other than Devanagari and Tamil. (Of
course,
there are exceptions--if something as crucial as the Euro was encoded
outside the BMP, there'd be support almost overnight.)
Thomas Chan
[EMAIL PROTECTED]
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/