On 2002.01.10, at 15:18, Jarkko Hietaniemi wrote: > Be certain to pick up the latest devel snapshot from: > > ftp://ftp.funet.fi/pub/languages/perl/snap/ > > It's changed quite a bit since 5.7.2. Not that much in the Encode > department, unfortunately (I think). Some of Sadahiro's patches > went in since 5.7.2, that much I can see.
Bad news. It's gotten worse on the latest DEVEL14150. It completely ignores 2byte chars. Here is the detailed research. I used MacOS 10.1.2 for 5.7.2 and FreeBSD 4.5-stable for DEVEL14150 (5.7.2 didn't just compile on FreeBSD; I think it's a know fact). # first let's see if conventional method works perl -MJcode -ple '$_=jcode($_,'euc')->utf8' table.euc > table.utf8 # table.euc is a euc-jp encoded text that contains all ascii, JISX0201 # (aka Hankaku Kana) and JISX0208 iconv -f euc-jp -t utf8 table.euc > iconv.utf8 iconv -f utf8 -t euc-jp table.utf8 > iconv.euc > diff -u table.euc iconv.euc --- table.euc Wed Nov 15 14:46:44 2000 +++ iconv.euc Thu Jan 10 19:03:58 2002 @@ -8,7 +8,7 @@ 0xa0c0: 0xa0e0: 0xa1a0: 、。,.・:;?!゛゜´`¨^ ̄_ヽヾゝゞ〃仝々〆〇ー―‐ -0xa1c0: \〜‖|…‥‘’“”()〔〕[]{}〈〉《》「」『』【】+−± +0xa1c0: _〜‖|…‥‘’“”()〔〕[]{}〈〉《》「」『』【】+−± 0xa1e0: ÷=≠<>≦≧∞∴♂♀°′″℃¥$¢£%#&*@§☆★○●◎◇ 0xa2a0: ◆□■△▲▽▼※〒→←↑↓〓 ∈∋⊆⊇⊂ 0xa2c0: ∪∩ ∧∨¬⇒⇔∀∃ ∠⊥⌒ (Don't worry; Sadahiro-san can read it). This difference is acceptable; This is due to the fact that Jcode preserves ASCII part [\x00-\x7e] untouched while iconv faithfully uses conversion table of Unicode Consortium ("Zenkaku Backslash" (That is, backslash that is mapped in JIS0208) back to ASCII backslash. With respect to mapping 2byte char back to ASCII, virtually no Japanese like that so I made Jcode to leave ASCII alone. That behavior can be overridden by setting $Jcode::Unicode::Pedantic = 1) In short, both Jcode and iconv are acceptable on daily use. Now comes Encode module of 5.7.2 # see the previous mail for classic.pl ../classic.pl -d table.euc camel572.utf8 ../classic.pl -e table.utf8 camel572.euc Voila! diff -u table.utf8 camel572.utf8 gives me an empty string! They are completely identical. Bad news is that encoding back to euc is the trash. Half way it would be it worked. Now DEVEL14150. Decode worked fine like 5.7.2 but when you try to encode from utf8 to euc-jp, perl croaks with; euc-jp '[non-printable garbage]' does not map to UTF-8 at /home/dankogai/perl/lib/5.7.2/i386-freebsd-multi-64int/Encode/Tcl.pm line 228 Now I am tempted to implement toplevel Encode myself.... Also, 5.7.2 and its variants appear pretty unstable. Let me see if Encode itself can work on 5.6.1 as well (should be, it's under ext/ directory after all. A little tweak on compile scripte would be needed, however). Dan the Man with Too Many Charsets to Handle