I've been immersed in Big5-related issues in the past few days, and
came back with these last-minute (err, week?) changes before 5.8-RC1.

The Diff contains fixes to TW.pm, Alias.pm, and README.(tw|cn).

(For jhi) README fixes are trivial -- mentions new HanExtra encodings,
fix some China word usage, and add my latin-1 name.

(For dan) big5-hkscs should be upgraded to the 2001 edition, as per
Hong Kong government's decree. It's available separately at:

    http://egb.elixus.org/~autrijus/big5-hkscs.ucm.gz

Also, please delete big5.ucm and replace it with big5-eten, at:

    http://egb.elixus.org/~autrijus/big5-eten.ucm.gz

I've fixed Alias.pm so big5 aliases to big5-eten. The reason is that
the 'Big5' as originally defined isn't used anywhere on earth; non-
Microsoft systems uses 'big5' to mean 'big5-eten', and Microsoft
uses 'big5' to mean 'cp950'.

It is therefore unwise to have a canonical 'big5' encoding, much like
there should not be a 'gb2312' encoding. Since gb2312 is now aliased
to euc-cn and not cp936, I think big5 should alias to big5-eten and
not cp950.

<!--
This is agreeing with T. H. Hsieh's similiar decision on glibc-2.2:
<http://www.linux.org.tw/mail-archie/cle-devel/cle-devel.200009/msg00100.html>;
this agrees with my FreeBSD charmap (and the dominating ETen charmap
in taiwan). The Unicode mappings now also agrees with libiconv-1.7's,
although the latter does not contain the ETen-specific parts.
-->

Oh, I just noticed that Dan retained the 'gb2312.ucm' name, although
the encoding is called 'gb2312-raw'. I admit that I don't fully
understand the reason, but if that's to stand, then big5-eten could also
be named 'big5.ucm', and still say '<code_set_name> "big5-eten"', for
consistency's sake.

Thanks,
/Autrijus/

--- /home/autrijus/perl/ext/Encode/TW/TW.pm     Fri Apr 19 22:02:58 2002
+++ TW.pm       Sat Apr 20 03:13:07 2002
@@ -30,10 +30,10 @@
 
   Canonical   Alias            Description
   --------------------------------------------------------------------
-  big5        /\bbig-?5$/i     The original Big5 encoding
-  big5-hkscs  /\bbig5-hk(scs)?$/i
-                                Big5 plus Cantonese characters in 
-                                Hong Kong
+  big5-eten   /\bbig-?5$/i     Big5 encoding (with ETen extensions)
+             /\bbig5-?et(en)?$/i
+  big5-hkscs  /\bbig5-?hk(scs)?$/i
+                                Big5 + Cantonese characters in Hong Kong
   MacChineseSimp               Big5 + Apple Vendor Mappings
   cp950                                Code Page 950 
                                 = Big5 + Microsoft vendor mappings
@@ -44,11 +44,18 @@
 =head1 NOTES
 
 Due to size concerns, C<EUC-TW> (Extended Unix Character), C<CCCII>
-(Chinese Character Code for Information Interchange) and C<BIG5PLUS>
-(CMEX's Big5+) are distributed separately on CPAN, under the name
-L<Encode::HanExtra>. That module also contains extra China-based encodings.
+(Chinese Character Code for Information Interchange), C<BIG5PLUS>
+(CMEX's Big5+) and C<BIG5EXT> (CMEX's Big5e) are distributed separately
+on CPAN, under the name L<Encode::HanExtra>. That module also contains
+extra China-based encodings.
 
 =head1 BUGS
+
+Since the original C<big5> encoding (1984) is not supported anywhere
+(glibc and DOS-based systems uses C<big5> to mean C<big5-eten>; Microsoft
+uses C<big5> to mean C<cp950>), a concious decision was made to alias
+C<big5> to C<big5-eten>, which is the de facto superset of the original
+big5.
 
 The C<CNS11643> encoding files are not complete. For common C<CNS11643>
 manipulation, please use C<EUC-TW> in L<Encode::HanExtra>, which contains
--- /home/autrijus/perl/ext/Encode/lib/Encode/Alias.pm  Wed Apr 10 05:13:28 2002
+++ Alias.pm    Sat Apr 20 03:11:11 2002
@@ -217,8 +217,9 @@
         define_alias( qr/(?:x-)?windows-949$/i    => '"cp949"' );
         define_alias( qr/\bks_c_5601-1987$/i      => '"cp949"' );
         # for Encode::TW
-       define_alias( qr/\bbig-?5$/i              => '"big5"' );
-       define_alias( qr/\bbig5-hk(?:scs)?$/i     => '"big5-hkscs"' );
+       define_alias( qr/\bbig-?5$/i              => '"big5-eten"' );
+       define_alias( qr/\bbig5-?et(?:en)$/i      => '"big5-eten"' );
+       define_alias( qr/\bbig5-?hk(?:scs)?$/i    => '"big5-hkscs"' );
     }
     # utf8 is blessed :)
     define_alias( qr/^UTF-8$/i => '"utf8"',);
--- /home/autrijus/perl/README.tw       Thu Apr 18 06:01:01 2002
+++ README.tw   Sat Apr 20 03:15:51 2002
@@ -29,8 +29,8 @@
 
 Encode ┑家舱や穿タ砰いゅ絪絏よΑ:
 
-    big5       ﹍ Big5 絪絏 (ぱらゅ)
-    big5-hkscs Big5 + 翠栋
+    big5       Big5 絪絏 (ぱ┑)
+    big5-hkscs Big5 + 翠栋, 2001 
     cp950      絏 950 (Big5 + 稬硁睰才)
 
 羭ㄒㄓ弧, 盢 Big5 絪絏郎锣Θ Unicode, 惠龄:
@@ -61,8 +61,10 @@
 狦惠璶いゅ絪絏, 眖 CPAN (L<http://www.cpan.org/>) 更
 Encode::HanExtra 家舱. ウヘ玡矗ㄑ絪絏よΑ:
 
+    cccii      1980 ゅ穦いゅ戈癟ユ传絏
     euc-tw     Unix ┑才栋,  CNS11643 キ 1-7
     big5plus   いゅ计てм砃崩約膀穦 Big5+
+    big5ext    いゅ计てм砃崩約膀穦 Big5e
 
 , Encode::HanConvert 家舱玥矗ㄑ虏羉锣传ノㄢ贺絪絏:
 
@@ -163,6 +165,6 @@
 
 Jarkko Hietaniemi E<lt>[EMAIL PROTECTED]<gt>
 
-﹙簙 E<lt>[EMAIL PROTECTED]<gt>
+Autrijus Tang (﹙簙) E<lt>[EMAIL PROTECTED]<gt>
 
 =cut
--- /home/autrijus/perl/README.cn       Thu Apr 18 06:01:01 2002
+++ README.cn   Sat Apr 20 03:15:43 2002
@@ -24,7 +24,7 @@
 
 Perl 本身以 Unicode 进行操作. 这表示 Perl 内部的字符串数据可用 Unicode
 表示; Perl 的函式与算符 (例?缯?规表示式比对) 也能对 Unicode 进行操作.
-在输?爰笆涑鍪?, 为了处理以 Unicode 之前的编码方式储存的数据, Perl
+在输?爰笆涑鍪?, 为了处理以 Unicode 之前的编码方式存放的数据, Perl
 提供了 Encode 这个模块, 可以?媚闱嵋椎囟寥〖靶慈刖捎械谋嗦胧?据.
 
 Encode 延伸模块支援下列简体中文的编码方式:
@@ -36,7 +36,7 @@
     cp936      字码页 936, 也称为 GBK (扩充国标码)
     hz         7 比特逸出式 GB2312 编码
 
-举例来说, 将 EUC-CN 编码的档案转成 Unicode, 祗需键?胂铝兄噶?:
+举例来说, 将 EUC-CN 编码的文档转成 Unicode, 祗需键?胂铝兄噶?:
 
     perl -Mencoding=euc-cn,STDOUT,utf8 -pe1 < file.euc-cn > file.utf8
 
@@ -51,12 +51,12 @@
     # 启动 euc-cn 字串解析; 标准输出?爰氨曜即砦蠖忌栉? euc-cn 编码
     use encoding 'euc-cn', STDIN => 'euc-cn', STDOUT => 'euc-cn';
     print length("骆驼");           #  2 (双引号表示字符)
-    print length('骆驼');           #  4 (单引号表示位元组)
+    print length('骆驼');           #  4 (单引号表示字节)
     print index("谆谆教诲", "蛔唤"); # -1 (不包含此子字符串)
     print index('谆谆教诲', '蛔唤'); #  1 (从第二个字节开始)
 
-在最后一列例子里, "谆" 的第二个位元组与 "谆" 的第一个位元组结合成 EUC-CN
-码的 "蛔"; "谆" 的第二个位元组则与 "教" 的第一个位元组结合成 "唤".
+在最后一列例子里, "谆" 的第二个字节与 "谆" 的第一个字节结合成 EUC-CN
+码的 "蛔"; "谆" 的第二个字节则与 "教" 的第一个字节结合成 "唤".
 这解决了以前 EUC-CN 码比对处理上常见的问题.
 
 =head2 额外的中文编码
@@ -143,6 +143,6 @@
 
 Jarkko Hietaniemi E<lt>[EMAIL PROTECTED]<gt>
 
-唐宗汉 E<lt>[EMAIL PROTECTED]<gt>
+Autrijus Tang (唐宗汉) E<lt>[EMAIL PROTECTED]<gt>
 
 =cut

Attachment: msg01193/pgp00000.pgp
Description: PGP signature

Reply via email to