Re: use Encode; # on Japanese; LONG!

Nick Ing-Simmons Fri, 11 Jan 2002 10:52:33 -0800

Dan Kogai <[EMAIL PROTECTED]> writes:
>Hi jhi,
>
>   My name is Dan Kogai.  I am a writer of Jcode.pm, which converts from
>various Japanese charset to others.  With the advent of Encode module
>that comes with Perl 5.7.2 and up,  I finally though that the role of
>Jcode, Jcode to r.i.p.  When I tested the module however, I found it was
>far from it.  Rather, I believe I can help in great deal with the
>current implementation.


Excellent ! ;-)

>
>Problem #1: Where is the rest of charset!?
>
>   When perl5.7.2 gets installed, it installs bunch of .enc files under
>Encoding/, including good-old euc-jp. but when you
>
>perl5.7.2 -MEncode -e 'print join(",", encodings), "\n";'
>
>   You get
>
>koi8-r,dingbats,iso-8859-10,iso-8859-13,cp37,iso-8859-9,iso-8859-6,iso-8859-1,
>cp1047,iso-8859-4,Internal,iso-8859-2,symbol,iso-8859-3,US-
>ascii,iso-8859-8,iso-8859-14,UCS-2,iso-8859-5,UTF-8,iso-8859-7,iso-8859-15,
>cp1250,iso-8859-16,posix-bc
>
>   Those are only 8-bit chars.

That was a deliberate decision on my part. Including "all" the ASCII-oid
8-bit encodings in their "compiled" form does not use much memory
(as they share at least 1/2 the space for the ASCII part).

The compiled forms of the multibyte and two-byte encodings are
larger. So I envisage -MEncode=japanese (say) to load clusters.


>I was at first disappointed but I thought
>it over and found Encode::Tcl module which contains no document.  I read
>the document over and over and finally found
>
>perl5.7.2 -MEncode -MEncode::Tcl -e 'print join(",", encodings), "\n";'
>
>   That gave me
>
>gb1988,cp857,macUkraine,dingbats,iso2022-jp,iso-8859-10,ksc5601,iso-8859-13,
>iso-8859-6,macTurkish,Internal,symbol,macJapan,iso2022,cp1250,posix-
>bc,cp1251,koi8-r,7bit-
>kr,cp437,cp866,iso-8859-3,cp874,iso-8859-8,macCyrillic,UCS-2,shiftjis,UTF-8,
>euc-jp,cp862,7bit-
>kana,cp861,cp860,macCroatian,jis0208,cp1254,cp37,iso-8859-9,7bit-
>jis,macGreek,big5,cp852,cp869,macCentEuro,iso-8859-1,cp1047,cp863,macIceland,
>macRoman,euc-
>kr,gsm0338,cp775,cp950,cp1253,cp424,cp856,cp850,iso-8859-16,cp1256,cp737,cp1252,
>macDingbats,jis0212,iso2022-kr,cp1006,euc-
>cn,cp949,cp855,gb2312,cp1255,iso-8859-4,iso-8859-2,cp1258,jis0201,cp864,US-ascii,
>cp936,iso-8859-14,iso-8859-5,iso-8859-7,iso-8859-15,cp865,macThai,HZ,macRomania,
>cp1257,gb12345,cp932

Encode::Tcl is SADAHIRO Tomoyuki's fixup/enhancement of the pure perl
version we used for a while before I invented the compiled form.
The Tcl-oid version is slow.

The .enc files are lifted straight from Tcl. It is unclear to me where
the mappings come from.

Modern Encode has C code that processes a compiled form and can compile
ICU-like .ucm files as well as .enc. The ICU form can represent fallbacks
and non-reversible stuff as well.

At that point in the coding it became unclear whether we could use ICU
stuff - I think we have since concluded that we can.

>
>   And I smiled and then wrote a test code as follows;
>
>Problem #2: Does it really work?
>
>   So here is the code #1 that encodes and decodes depending depending on
>the option.
>
>#!/usr/local/bin/perl5.7.2
>
>use strict;
>use Encode;
>use Encode::Tcl;
>
>my ($which, $from, $to) = @ARGV;
>my $op = $which =~ /e/ ?  ¥&encode :
>     $which =~ /d/ ?  ¥&decode : die "$0 [-[e|d]c] from to¥n";
>my $check = $which =~ /c/;
>$check and warn "check set.¥n";
>
>open my $in,  "<$from" or die "$from:$!";
>open my $out, ">$to"   or die "$to:$!";
>
>while(defined(my $line = <$in>)){
>     use bytes;

File IO of encoded or UTF-8 data is very very messy prior to perl5.7
At best 'use bytes' is a hack.

>     # or print bitches as follows;
>     # Wide character in print at ./classic.pl line 15, <$in> line 260.
>     print $out $op->('euc-jp', $line, $check);
>
>}
>__END__
>
>   It APPEARS to (en|de)code chars -- with lots of problem.
>   I fed Jcode/t/table.euc, the file that contains all characters defined
>in JISX201 and JISX208.  Jcode tests it self by converting that file
>then back.  If (en|de)coder is OK, euc-jp -> utf8 -> euc-jp must convert
>the character back.  In case of the code above it did not.   However
>many of the characters appeared converted.  Emacs failed to
>auto-recognize the character format but when I fed the resulting files
>to JEdit with character set explicitly specified, there appears
>converted characters.
>
>   Then I also tried this one.
>
>#!/usr/local/bin/perl5.7.2
>
>use strict;
>use Encode;
>use Encode::Tcl;
>
>my ($which, $from, $to) = @ARGV;
>my ($op, $icode, $ocode);
>if    ($which =~ /e/){
>      $icode = "utf8"; $ocode="encoding('euc-jp')";
>}elsif($which =~ /d/){
>      $icode = "encoding('euc-jp')"; $ocode="utf8";
>}else{
>     die "$0 -[e|d] from to¥n";
>}
>
>open my $in,  "<:$icode", $from or die "$from:$!";
>open my $out, ">:$ocode", $to   or die "$to:$!";
>
>while(defined(my $line = <$in>)){
>     use bytes;
      ^^^^^^^^^^  Catastrophic I would guess.
use bytes says "I know exactly what I am doing" and so even though
perl knows better it believes you and fails to UTF-8-ify things
etc.

>     print $out $line;
>
>}
>__END__
>
>   A new style.  It does convert but convert differently from the
>previous code.  Also this
>
>> Cannot find encoding "'euc-jp'" at ./newway.pl line 17.
>> :Invalid argument.
>
>   appears for some reasons.
>   I can only say Encode is far from production level, so far as Japanese
>charset is concerned.

I would agree.
It would be good to have some test data in various encodings.
This is easy for 8-bit encodings 0..255 is all you need. But for
16-bit encodings (with gaps) and in particular multi-byte encodings
you need a "sensible" starting sample.



>
>Problem #3: How about performance?
>
>   It's silly to talk about perfomance before code runs right from the
>first place but I could not help checking it out
>   Encode::Tcl implements conversion by filling lookup-table on-the-fly.
>That's what Jcode::Unicode::NoXS does (well, mine uses lookup hash,
>though).  How's the performance?  I naturally benchmarked.
>
>#!/usr/local/bin/perl5.7.2
>
>use Benchmark;
>use Encode;
>use Encode::Tcl;
>use Jcode;
>
>my $count = $ARGV[0] || 1;
>
>sub subread{
>     open my $fh, 'table.euc';
>     read $fh, my $eucstr, -f 'table.euc';
>     undef $fh;
>}
>
>timethese($count,
>           {
>               "Encode::Tcl" =>
>                   sub { my $decoded = decode('euc-jp', $eucstr, 1) },
>                   "Jcode" =>
>                   sub { my $decoded = Jcode::convert($eucstr, 'utf8',
>'euc') },
>               }
>           );
>__END__
>
>And here is the result.
>
>Benchmark: timing 1 iterations of Encode::Tcl, Jcode...
>Encode::Tcl:  1 wallclock secs ( 0.28 usr +  0.00 sys =  0.28 CPU) @
>3.57/s (n=1)
>             (warning: too few iterations for a reliable count)
>      Jcode:  0 wallclock secs ( 0.02 usr +  0.00 sys =  0.02 CPU) @
>50.00/s (n=1)
>             (warning: too few iterations for a reliable count)
>Benchmark: timing 100 iterations of Encode::Tcl, Jcode...
>Encode::Tcl:  1 wallclock secs ( 0.32 usr +  0.00 sys =  0.32 CPU) @
>312.50/s (n=100)
>             (warning: too few iterations for a reliable count)
>      Jcode:  0 wallclock secs ( 0.03 usr +  0.00 sys =  0.03 CPU) @
>3333.33/s (n=100)
>             (warning: too few iterations for a reliable count)
>Benchmark: timing 1000 iterations of Encode::Tcl, Jcode...
>Encode::Tcl:  1 wallclock secs ( 0.38 usr +  0.00 sys =  0.38 CPU) @
>2631.58/s (n=1000)
>             (warning: too few iterations for a reliable count)
>      Jcode:  1 wallclock secs ( 0.11 usr +  0.00 sys =  0.11 CPU) @
>9090.91/s (n=1000)
>             (warning: too few iterations for a reliable count)
>
>   Just as I guessed.  The first invocation of Encode::Tcl is way slow
>because it has to fill the lookup table.  It gets faster as time goes
>by.  The current implementation of Jcode (with XS) also suffers the
>performance problem on utf8 because it first converts the chars to UCS2
>then UTF8.
>
>#4;  Conclusion
>
>   I think I have grokked both in fullness to implement
>Encode::Japanese.  I know you don't grok Japanese very well (which you
>don't have to; I don't grok Finnish either :).  It takes more than a
>simple table lookup to handle Japanese well enough to make native
>grokkers happy.  It has to automatically detect which of many charsets
>are used, it has to be robust, and most of all, it must be documented in
>Japanese :)  I can do all that.
>   I believe Jcode must someday cease to exist as Camel starts to grok
>Japanese.  With Encode module the day is sooner than I expected and I
>want to help you make  my day.
>   If I submit Encode::Japanese, are you going to merge it standard
>module?

I encourage you to look at Encode/encengine.c - it is a state machine
which reads tables to transform octet-sequences.

It is a lot faster than Encode::Tcl scheme.

I _think_ Encode/compile (which builds the tables) does right thing for
multi-byte and 16-bit encodings but as I have no reliable test data,
viewer or judgement of end result I cannot be sure.

What I would like to see is :

A. A review of Encode's APIs and principles to make sure I have not
   done anything really stupid. Both API from perl script's perspective
   and also the API/mechanism that it expects an Encoding "plugin" to
   provide.

B. "Blessing" of the Xxxxx <-> Unicode mappings for various encodings.
    Are Tcl's "good enough" or should we use ICU's or Unicode's or ... ?

C. Point me at "clusters" of related encodings that are often used
   together and I can have a crack and building "compiled" XS module
   that provides those encodings.

D. Some discussion as to how to handle escape encodings and/or
   heuristics for guessing encoding. I had some 1/4 thought out
   ideas for how to get encengine.c to assist on these too - but
   I have probably forgotten them.

>
>Dan the Man with Too Many Charsets to Deal With
--
Nick Ing-Simmons
http://www.ni-s.u-net.com/

Re: use Encode; # on Japanese; LONG!

Reply via email to