I've been immersed in Big5-related issues in the past few days, and came back with these last-minute (err, week?) changes before 5.8-RC1.
The Diff contains fixes to TW.pm, Alias.pm, and README.(tw|cn). (For jhi) README fixes are trivial -- mentions new HanExtra encodings, fix some China word usage, and add my latin-1 name. (For dan) big5-hkscs should be upgraded to the 2001 edition, as per Hong Kong government's decree. It's available separately at: http://egb.elixus.org/~autrijus/big5-hkscs.ucm.gz Also, please delete big5.ucm and replace it with big5-eten, at: http://egb.elixus.org/~autrijus/big5-eten.ucm.gz I've fixed Alias.pm so big5 aliases to big5-eten. The reason is that the 'Big5' as originally defined isn't used anywhere on earth; non- Microsoft systems uses 'big5' to mean 'big5-eten', and Microsoft uses 'big5' to mean 'cp950'. It is therefore unwise to have a canonical 'big5' encoding, much like there should not be a 'gb2312' encoding. Since gb2312 is now aliased to euc-cn and not cp936, I think big5 should alias to big5-eten and not cp950. <!-- This is agreeing with T. H. Hsieh's similiar decision on glibc-2.2: <http://www.linux.org.tw/mail-archie/cle-devel/cle-devel.200009/msg00100.html>; this agrees with my FreeBSD charmap (and the dominating ETen charmap in taiwan). The Unicode mappings now also agrees with libiconv-1.7's, although the latter does not contain the ETen-specific parts. --> Oh, I just noticed that Dan retained the 'gb2312.ucm' name, although the encoding is called 'gb2312-raw'. I admit that I don't fully understand the reason, but if that's to stand, then big5-eten could also be named 'big5.ucm', and still say '<code_set_name> "big5-eten"', for consistency's sake. Thanks, /Autrijus/ --- /home/autrijus/perl/ext/Encode/TW/TW.pm Fri Apr 19 22:02:58 2002 +++ TW.pm Sat Apr 20 03:13:07 2002 @@ -30,10 +30,10 @@ Canonical Alias Description -------------------------------------------------------------------- - big5 /\bbig-?5$/i The original Big5 encoding - big5-hkscs /\bbig5-hk(scs)?$/i - Big5 plus Cantonese characters in - Hong Kong + big5-eten /\bbig-?5$/i Big5 encoding (with ETen extensions) + /\bbig5-?et(en)?$/i + big5-hkscs /\bbig5-?hk(scs)?$/i + Big5 + Cantonese characters in Hong Kong MacChineseSimp Big5 + Apple Vendor Mappings cp950 Code Page 950 = Big5 + Microsoft vendor mappings @@ -44,11 +44,18 @@ =head1 NOTES Due to size concerns, C<EUC-TW> (Extended Unix Character), C<CCCII> -(Chinese Character Code for Information Interchange) and C<BIG5PLUS> -(CMEX's Big5+) are distributed separately on CPAN, under the name -L<Encode::HanExtra>. That module also contains extra China-based encodings. +(Chinese Character Code for Information Interchange), C<BIG5PLUS> +(CMEX's Big5+) and C<BIG5EXT> (CMEX's Big5e) are distributed separately +on CPAN, under the name L<Encode::HanExtra>. That module also contains +extra China-based encodings. =head1 BUGS + +Since the original C<big5> encoding (1984) is not supported anywhere +(glibc and DOS-based systems uses C<big5> to mean C<big5-eten>; Microsoft +uses C<big5> to mean C<cp950>), a concious decision was made to alias +C<big5> to C<big5-eten>, which is the de facto superset of the original +big5. The C<CNS11643> encoding files are not complete. For common C<CNS11643> manipulation, please use C<EUC-TW> in L<Encode::HanExtra>, which contains --- /home/autrijus/perl/ext/Encode/lib/Encode/Alias.pm Wed Apr 10 05:13:28 2002 +++ Alias.pm Sat Apr 20 03:11:11 2002 @@ -217,8 +217,9 @@ define_alias( qr/(?:x-)?windows-949$/i => '"cp949"' ); define_alias( qr/\bks_c_5601-1987$/i => '"cp949"' ); # for Encode::TW - define_alias( qr/\bbig-?5$/i => '"big5"' ); - define_alias( qr/\bbig5-hk(?:scs)?$/i => '"big5-hkscs"' ); + define_alias( qr/\bbig-?5$/i => '"big5-eten"' ); + define_alias( qr/\bbig5-?et(?:en)$/i => '"big5-eten"' ); + define_alias( qr/\bbig5-?hk(?:scs)?$/i => '"big5-hkscs"' ); } # utf8 is blessed :) define_alias( qr/^UTF-8$/i => '"utf8"',); --- /home/autrijus/perl/README.tw Thu Apr 18 06:01:01 2002 +++ README.tw Sat Apr 20 03:15:51 2002 @@ -29,8 +29,8 @@ Encode ┑家舱や穿タ砰いゅ絪絏よΑ: - big5 ﹍ Big5 絪絏 (ぱらゅ) - big5-hkscs Big5 + 翠栋 + big5 Big5 絪絏 (ぱ┑) + big5-hkscs Big5 + 翠栋, 2001 cp950 絏 950 (Big5 + 稬硁睰才) 羭ㄒㄓ弧, 盢 Big5 絪絏郎锣Θ Unicode, 惠龄: @@ -61,8 +61,10 @@ 狦惠璶いゅ絪絏, 眖 CPAN (L<http://www.cpan.org/>) 更 Encode::HanExtra 家舱. ウヘ玡矗ㄑ絪絏よΑ: + cccii 1980 ゅ穦いゅ戈癟ユ传絏 euc-tw Unix ┑才栋, CNS11643 キ 1-7 big5plus いゅ计てм砃崩約膀穦 Big5+ + big5ext いゅ计てм砃崩約膀穦 Big5e , Encode::HanConvert 家舱玥矗ㄑ虏羉锣传ノㄢ贺絪絏: @@ -163,6 +165,6 @@ Jarkko Hietaniemi E<lt>[EMAIL PROTECTED]<gt> -﹙簙 E<lt>[EMAIL PROTECTED]<gt> +Autrijus Tang (﹙簙) E<lt>[EMAIL PROTECTED]<gt> =cut --- /home/autrijus/perl/README.cn Thu Apr 18 06:01:01 2002 +++ README.cn Sat Apr 20 03:15:43 2002 @@ -24,7 +24,7 @@ Perl 本身以 Unicode 进行操作. 这表示 Perl 内部的字符串数据可用 Unicode 表示; Perl 的函式与算符 (例?缯?规表示式比对) 也能对 Unicode 进行操作. -在输?爰笆涑鍪?, 为了处理以 Unicode 之前的编码方式储存的数据, Perl +在输?爰笆涑鍪?, 为了处理以 Unicode 之前的编码方式存放的数据, Perl 提供了 Encode 这个模块, 可以?媚闱嵋椎囟寥〖靶慈刖捎械谋嗦胧?据. Encode 延伸模块支援下列简体中文的编码方式: @@ -36,7 +36,7 @@ cp936 字码页 936, 也称为 GBK (扩充国标码) hz 7 比特逸出式 GB2312 编码 -举例来说, 将 EUC-CN 编码的档案转成 Unicode, 祗需键?胂铝兄噶?: +举例来说, 将 EUC-CN 编码的文档转成 Unicode, 祗需键?胂铝兄噶?: perl -Mencoding=euc-cn,STDOUT,utf8 -pe1 < file.euc-cn > file.utf8 @@ -51,12 +51,12 @@ # 启动 euc-cn 字串解析; 标准输出?爰氨曜即砦蠖忌栉? euc-cn 编码 use encoding 'euc-cn', STDIN => 'euc-cn', STDOUT => 'euc-cn'; print length("骆驼"); # 2 (双引号表示字符) - print length('骆驼'); # 4 (单引号表示位元组) + print length('骆驼'); # 4 (单引号表示字节) print index("谆谆教诲", "蛔唤"); # -1 (不包含此子字符串) print index('谆谆教诲', '蛔唤'); # 1 (从第二个字节开始) -在最后一列例子里, "谆" 的第二个位元组与 "谆" 的第一个位元组结合成 EUC-CN -码的 "蛔"; "谆" 的第二个位元组则与 "教" 的第一个位元组结合成 "唤". +在最后一列例子里, "谆" 的第二个字节与 "谆" 的第一个字节结合成 EUC-CN +码的 "蛔"; "谆" 的第二个字节则与 "教" 的第一个字节结合成 "唤". 这解决了以前 EUC-CN 码比对处理上常见的问题. =head2 额外的中文编码 @@ -143,6 +143,6 @@ Jarkko Hietaniemi E<lt>[EMAIL PROTECTED]<gt> -唐宗汉 E<lt>[EMAIL PROTECTED]<gt> +Autrijus Tang (唐宗汉) E<lt>[EMAIL PROTECTED]<gt> =cut
msg01193/pgp00000.pgp
Description: PGP signature