Re: use Encode; # on Japanese; LONG!

Dan Kogai Fri, 11 Jan 2002 14:49:21 -0800

Nick,

   Here you are at last.


> Excellent ! ;-)

   I grokked Encode even further and now I grokked that I need to set 
some road map before I move forward....
   Here is a list of what I am not sure.

Portablity: make Encode portable with pre-perl 5.6?
        
   That needs a complete rewrite of the current code;  Encode today is 
too CORE:: dependent (such as the use of utf8:: subs).  Still it's worth 
it and with proper  #ifdef's I think I can make even XS portable.
   My opnion is to make Encode available both as part of core and 
independent module like so many popular ones as libnet, DB_File, 
Storable, to name a few.
Remember there are still lots of sites without 5.6 with good reasons.

Conversion Table: where should we store that?

   Encode today saves it as separate files.  This is a normal approach 
for a modern programmer to save data and codes separately module writers 
may disagree;  They may be happier if they can browse actual data via 
'perldoc -m'
   This reminds me of the fact that Encode.pm contains multiple packages 
such as Encode::Encodings.  I was at first lost when I 'perldoc 
Encode::Encodings'.

> That was a deliberate decision on my part. Including "all" the ASCII-oid
> 8-bit encodings in their "compiled" form does not use much memory
> (as they share at least 1/2 the space for the ASCII part).

   As a programmer I say that's fair.  As a native user of non-roman 
script I say CJK is once again discriminated.  It would be nice if it 
shows all currently available character sets without loading it -- or 
ASCII and nothing else by default.

> The compiled forms of the multibyte and two-byte encodings are
> larger. So I envisage -MEncode=japanese (say) to load clusters.

   Once again it is programatically correct and politically incorrect.  
IMHO Encode loads nothing but ASCII and utf8 by default to be fair.


> Encode::Tcl is SADAHIRO Tomoyuki's fixup/enhancement of the pure perl
> version we used for a while before I invented the compiled form.
> The Tcl-oid version is slow.

   Yes, it is but it works.  Also complied form is so far only available 
to 8-bit  charsets.

> The .enc files are lifted straight from Tcl. It is unclear to me where
> the mappings come from.

   I believe they (I mean Tclers) just converted the forms at 
ftp://ftp.unicode.org/Public/MAPPINGS/ to their taste.... Oh shoot!  I 
just checked the URI above and found EASTASIA is missing now!

> Modern Encode has C code that processes a compiled form and can compile
> ICU-like .ucm files as well as .enc. The ICU form can represent 
> fallbacks
> and non-reversible stuff as well.

..ucm is much easier on my eyeballs, though somewhat bulky.

> At that point in the coding it became unclear whether we could use ICU
> stuff - I think we have since concluded that we can.

jhi answered that one but I am not sure if we make ICU standard for perl 
encoding exchange....

> File IO of encoded or UTF-8 data is very very messy prior to perl5.7
> At best 'use bytes' is a hack.

I know.  To be honest with you file IO semantics (and IO handle) is one 
of my least favorite part of the beast (But I agree this is one of the 
oldest guts of perl.  I started using perl because awk didn't let me 
open multiple files at once :).


>>
>> while(defined(my $line = <$in>)){
>>     use bytes;
>       ^^^^^^^^^^  Catastrophic I would guess.
> use bytes says "I know exactly what I am doing" and so even though
> perl knows better it believes you and fails to UTF-8-ify things
> etc.

Is there a straight interface that switches byte semantics and utf8 RUN 
TIME?
I just noticed a script like the one above just needs that.

$toencode and eval {use bytes;};  # too hairy!

>>   appears for some reasons.
>>   I can only say Encode is far from production level, so far as 
>> Japanese
>> charset is concerned.
>
> I would agree.
> It would be good to have some test data in various encodings.
> This is easy for 8-bit encodings 0..255 is all you need. But for
> 16-bit encodings (with gaps) and in particular multi-byte encodings
> you need a "sensible" starting sample.

   Yes.  As a writer of Jcode I know that only too well.  Japanese is not 
hard to learn to speak; Japanese encoding is not.  There are AT LEAST 4 
encodings you have to deal with (euc-jp, shiftjis, iso-2022-jp, and 
Unicode).  Actually the situation of Japanese encoding is tougher than 
other East Asian languages because Japan started computing before others 
did.  Others din't have to make the same mistake we did.  Oh well....

> I encourage you to look at Encode/encengine.c - it is a state machine
> which reads tables to transform octet-sequences.

   I did.  Would you make your tabstop to 4 :)?

> It is a lot faster than Encode::Tcl scheme.
>
> I _think_ Encode/compile (which builds the tables) does right thing for
> multi-byte and 16-bit encodings but as I have no reliable test data,
> viewer or judgement of end result I cannot be sure.

   It does but it still doesn't cut escape-based codings like iso-2022.

> What I would like to see is :
>
> A. A review of Encode's APIs and principles to make sure I have not
>    done anything really stupid. Both API from perl script's perspective
>    and also the API/mechanism that it expects an Encoding "plugin" to
>    provide.

   Yes.  Thanks to the API encoders can be written very portable.  Here 
is Encode::Jcode that I wrote in 3 minutes that worked.

use strict;
use Jcode;
use Encode qw(find_encoding);
use base 'Encode::Encoding';
use Carp;

sub add_encodings{
     for my $canon (qw(euc-jp iso-2022-jp shiftjis)){
         my $obj = bless { Name => $canon }, __PACKAGE__;
         $obj->Define($canon);
     }
}

sub import{
     add_encodings();
}

my %canon2jcode = (
     'euc-jp'      => 'euc',
     'shifjis'     => 'sjis',
     'iso-2022-jp' => 'iso_2022_jp',
);

use Data::Dumper;
sub encode{
     my ($self, $string, $check) = @_;
     my $name = $canon2jcode{$self->{Name}};
     no strict 'refs';
     return jcode($string, 'utf8')->$name;
}

sub decode{
     my ($self, $octet, $check) = @_;
     my $name = $canon2jcode{$self->{Name}};
     return jcode($octet, $name)->utf8;
}

1;

   The problem is Encode itself is not portable enough to be indepent 
module....

> B. "Blessing" of the Xxxxx <-> Unicode mappings for various encodings.
>     Are Tcl's "good enough" or should we use ICU's or Unicode's or ... ?

   IMHO Tcl's is good enough TO START.  But implementation.  Hmm....


> C. Point me at "clusters" of related encodings that are often used
>    together and I can have a crack and building "compiled" XS module
>    that provides those encodings.

   Another good question is how much to XS.  Even Jcode comes with NoXS 
module for those environments where you can't build XS, such as ISP's 
server, MacOS and Windows...

> D. Some discussion as to how to handle escape encodings and/or
>    heuristics for guessing encoding. I had some 1/4 thought out
>    ideas for how to get encengine.c to assist on these too - but
>    I have probably forgotten them.

   Well, encode guessing appears not as needed in other languages as 
Japanese.  Most other either have 'OLD' (pre-Unicode) and 'New' 
(Unicode).  China is a good example; they virtually have gb2312 and 
Unicode and that's it.
   As for Japanese, just check Internet Explorer and check the charset 
menu.  Only Japanese has 'Auto Detect'.
   Here is how Jcode 'Auto Detect's character code.  Purely in perl.

sub getcode {
     my $thingy = shift;
     my $r_str = ref $thingy ? $thingy : ¥$thingy;

     my ($code, $nmatch, $sjis, $euc, $utf8) = ("", 0, 0, 0, 0);
     if ($$r_str =~ /$RE{BIN}/o) {       # 'binary'
         my $ucs2;
         $ucs2 += length($1)
             while $$r_str =~ /(¥x00$RE{ASCII})+/go;
         if ($ucs2){      # smells like raw unicode
             ($code, $nmatch) = ('ucs2', $ucs2);
         }else{
             ($code, $nmatch) = ('binary', 0);
          }
     }
     elsif ($$r_str !~ /[¥e¥x80-¥xff]/o) {       # not Japanese
         ($code, $nmatch) = ('ascii', 1);
     }                           # 'jis'
     elsif ($$r_str =~
            m[
              $RE{JIS_0208}|$RE{JIS_0212}|$RE{JIS_ASC}|$RE{JIS_KANA}
            ]ox)
     {
         ($code, $nmatch) = ('jis', 1);
     }
     else { # should be euc|sjis|utf8
         # use of (?:) by Hiroki Ohzaki <[EMAIL PROTECTED]>
         $sjis += length($1)
             while $$r_str =~ /((?:$RE{SJIS_C})+)/go;
         $euc  += length($1)
             while $$r_str =~ 
/((?:$RE{EUC_C}|$RE{EUC_KANA}|$RE{EUC_0212})+)/go;
         $utf8  += length($1)
             while $$r_str =~ /((?:$RE{UTF8})+)/go;
         $nmatch = _max($utf8, $sjis, $euc);
         carp ">DEBUG:sjis = $sjis, euc = $euc, utf8 = $utf8" if 
$DEBUG >= 3;
         $code =
             ($euc > $sjis and $euc > $utf8) ? 'euc' :
                 ($sjis > $euc and $sjis > $utf8) ? 'sjis' :
                     ($utf8 > $euc and $utf8 > $sjis) ? 'utf8' : undef;
     }
     return wantarray ? ($code, $nmatch) : $code;
}

Well, I need to get some sleep now....

Dan the Man with Too Many Charsets To Deal With

Re: use Encode; # on Japanese; LONG!

Reply via email to