On 2002.01.10, at 15:18, Jarkko Hietaniemi wrote:
> Be certain to pick up the latest devel snapshot from:
>
> ftp://ftp.funet.fi/pub/languages/perl/snap/
>
> It's changed quite a bit since 5.7.2.  Not that much in the Encode
> department, unfortunately (I think).  Some of Sadahiro's patches
> went in since 5.7.2, that much I can see.

Bad news.  It's gotten worse on the latest DEVEL14150.  It completely 
ignores 2byte chars.  Here is the detailed research.

I used MacOS 10.1.2 for 5.7.2 and FreeBSD 4.5-stable for DEVEL14150 
(5.7.2 didn't just compile on FreeBSD; I think it's a know fact).

# first let's see if conventional method works
perl -MJcode -ple '$_=jcode($_,'euc')->utf8' table.euc > table.utf8
# table.euc is a euc-jp encoded text that contains all ascii, JISX0201
# (aka Hankaku Kana) and JISX0208
iconv -f euc-jp -t utf8 table.euc  > iconv.utf8
iconv -f utf8 -t euc-jp table.utf8 > iconv.euc

 > diff -u table.euc iconv.euc
--- table.euc   Wed Nov 15 14:46:44 2000
+++ iconv.euc   Thu Jan 10 19:03:58 2002
@@ -8,7 +8,7 @@
  0xa0c0:
  0xa0e0:
  0xa1a0:    、。,.・:;?!゛゜´`¨^ ̄_ヽヾゝゞ〃仝々〆〇ー―‐
-0xa1c0: \〜‖|…‥‘’“”()〔〕[]{}〈〉《》「」『』【】+−±
+0xa1c0: _〜‖|…‥‘’“”()〔〕[]{}〈〉《》「」『』【】+−±
  0xa1e0: ÷=≠<>≦≧∞∴♂♀°′″℃¥$¢£%#&*@§☆★○●◎◇
  0xa2a0:   ◆□■△▲▽▼※〒→←↑↓〓                      ∈∋⊆⊇⊂
  0xa2c0: ∪∩                ∧∨¬⇒⇔∀∃                      
∠⊥⌒

(Don't worry; Sadahiro-san can read it).  This difference is 
acceptable;  This is due to the fact that Jcode preserves ASCII part 
[\x00-\x7e] untouched while iconv faithfully uses conversion table of 
Unicode Consortium ("Zenkaku Backslash" (That is, backslash that is 
mapped in JIS0208) back to ASCII backslash.  With respect to mapping 
2byte char back to ASCII, virtually no Japanese like that so I made 
Jcode to leave ASCII alone.  That behavior can be overridden by setting 
$Jcode::Unicode::Pedantic = 1)  In short, both Jcode and iconv are 
acceptable on daily use.

Now comes Encode module of 5.7.2

# see the previous mail for classic.pl
../classic.pl -d table.euc camel572.utf8
../classic.pl -e table.utf8 camel572.euc

Voila!  diff -u table.utf8 camel572.utf8 gives me an empty string!  They 
are completely identical.  Bad news is that encoding back to euc is the 
trash.  Half way it would be it worked.

Now  DEVEL14150.  Decode worked fine like 5.7.2 but when you try to 
encode from utf8 to euc-jp,  perl croaks with;

euc-jp '[non-printable garbage]' does not map to UTF-8 at 
/home/dankogai/perl/lib/5.7.2/i386-freebsd-multi-64int/Encode/Tcl.pm 
line 228

Now I am tempted to implement toplevel Encode myself....

Also, 5.7.2 and its variants appear pretty unstable.  Let me see if 
Encode itself can work on 5.6.1 as well (should be, it's under ext/ 
directory after all.  A little tweak on compile scripte would be needed, 
however).

Dan the Man with Too Many Charsets to Handle

Reply via email to