Re: Encode::XS for CJK

Dan Kogai Thu, 31 Jan 2002 04:40:31 -0800

On 2002.01.31, at 17:13, Nick Ing-Simmons wrote:
>>   Now the problem is escape-based codings such as ISO-2022.
>
> Can you explain the way those work?
> I can imagine two ways for decode:
> A  - keep going with current sub-encoding till we get a fail,
>      then look at next few octets for an escape sequence.
> B. - Scan ahead for next escape sequence (or end of available input)
>      then translate up to that.


　To answer these questions, let's see what the existing utilities do.  
Here I will discuss NKF, jcode.pl and my humble Jcode.

NKF (Network Kanji Filter)      ftp://ftp.ie.u-
ryukyu.ac.jp/pub/software/kono/

* 1st appeared in 1987.  Still maintained.
* Handles EUC-JP, JIS (ISO-2022-JP) and SHIFT_JIS.
* No Unicode support to the date.  This is understandable because other
   "Legacy" encodings needs no conversion table since they are all based
   upon JISX2xx.
* Stream based.  No buffer allocation and such (later changed when NKF.pm
   was added to the distribution.  But even this case NKF.xs just does
   buffer handling and nkf(1) does no in-memory conversion).
* Uses method B for ISO-2022 (or my ungetc() !).

jcode.pl  ftp://ftp.iij.ad.jp/pub/IIJ/dist/utashiro/perl/

* 1st appeared in 1992, BEFORE Perl5
* still maintained; still widely used for the same reason cgi-lib.pl is
   used instead of CGI.pm
* Written 100% in perl
* No Unicode support
* Method C?  Just like method B but it uses regex to grab between escape
   boundaries (see Jcode.pm) below

Jcode.pm  http://www.openlab.gr.jp/Jcode/

* 1st appeared in 1999
* Unicode support added (XS and NoXS both supported)
* object model
* internal routines based upon jcode.pl
* and here is the piece of sub that does JIS -> EUC conversion

sub jis_euc {
     my $thingy = shift;
     my $r_str = ref $thingy ? $thingy : \$thingy;
     $$r_str =~ s(
                  ($RE{JIS_0212}|$RE{JIS_0208}|$RE{JIS_ASC}|$RE{JIS_KANA})
                  ([^¥e]*)
                  )
     {
         my ($esc, $str) = ($1, $2);
         if ($esc !~ /$RE{JIS_ASC}/o) {
             $str =~ tr/\x21-\x7e/\xa1-\xfe/;
             if ($esc =~ /$RE{JIS_KANA}/o) {
                 $str =~ s/([\xa1-\xdf])/\x8e$1/og;
             }
             elsif ($esc =~ /$RE{JIS_0212}/o) {
                 $str =~ s/([\xa1-\xfe][\xa1-\xfe])/\x8f$1/og;
             }
         }
         $str;
     }geox;
     $$r_str;
}

> You could bundle several encodings in one XS (the way Encode itself
> bundles ASCII, ios-8859-* and koi8).

   I know.  But it is bulky and another problem is that Tcl has a 
different notion of 'Escape' (like euc_jp_0212, which is not exactly an 
escape but an extension) which needs to be corrected for the practical 
use.

> If any of the bundled encodings have similar sequences of code points
> then we will get overall table size reductions too.
>
> In the limit one could have Encode::CJK, but perhaps
> Encode::JP / Encode::CN / Encode::KR makes more sense ???

   Right.  From a user's point of view distinct package space for each 
(human) language is better.  But again, this can be implemented like

Encode::EUC (does all euc-based conversion)
Encode::JP  (Wrapper module that calls Encode::EUC and Encode::ISO2022)
Encode::KR
Encode::ZN

   and so forth.
   Actually even more table reduction can be done between SHIFT_JIS and 
EUC.  They are all based upon JISX0208 (and 0201 and 0212) so simple 
calculation
converts one another.

> We could ship things pre-compiled (with origianal .ucm's gzipped, or
> provide a way to extract a .ucm from the compiled form).
> Also the compile process is all in perl and has not really been tunned.
> It spends a lot of time trying to find common "strings" (which gets 
> tables
> down in size so is a win.)

   Right.  How we do that we still need more experiments but this is what 
should be done....

Dan

Re: Encode::XS for CJK

Reply via email to