Re: Encode::XS for CJK

Nick Ing-Simmons Thu, 31 Jan 2002 08:15:53 -0800

Dan Kogai <[EMAIL PROTECTED]> writes:
>On 2002.01.31, at 17:13, Nick Ing-Simmons wrote:
>>>   Now the problem is escape-based codings such as ISO-2022.
>>
>> Can you explain the way those work?
>> I can imagine two ways for decode:
>> A  - keep going with current sub-encoding till we get a fail,
>>      then look at next few octets for an escape sequence.
>> B. - Scan ahead for next escape sequence (or end of available input)
>>      then translate up to that.
>
>　To answer these questions, let's see what the existing utilities do.
>Here I will discuss NKF, jcode.pl and my humble Jcode.
>
>
>jcode.pl  ftp://ftp.iij.ad.jp/pub/IIJ/dist/utashiro/perl/
>
>* 1st appeared in 1992, BEFORE Perl5
>* still maintained; still widely used for the same reason cgi-lib.pl is
>   used instead of CGI.pm
>* Written 100% in perl
>* No Unicode support
>* Method C?  Just like method B but it uses regex to grab between escape
>   boundaries (see Jcode.pm) below


So would I if doing it in perl.

>
>> You could bundle several encodings in one XS (the way Encode itself
>> bundles ASCII, ios-8859-* and koi8).
>
>   I know.  But it is bulky

As a quick hack - Tried bundling :

                'euc-jp.ucm',
                'jis0201.enc',
                'jis0212.enc',
                'jis0208.enc',
                'shiftjis.enc',

The resulting XS's string table was only slightly larger than the one
for euc-jp.ucm on its own. (But time to compile was much longer.)

>and another problem is that Tcl has a
>different notion of 'Escape' (like euc_jp_0212, which is not exactly an
>escape but an extension)

Tcl has both E (escape) encoding and X (eXtension) encoding as type fields.
I don't remember that from tcl/tk ...

>which needs to be corrected for the practical
>use.
>
>> If any of the bundled encodings have similar sequences of code points
>> then we will get overall table size reductions too.
>>
>> In the limit one could have Encode::CJK, but perhaps
>> Encode::JP / Encode::CN / Encode::KR makes more sense ???
>
>   Right.  From a user's point of view distinct package space for each
>(human) language is better.  But again, this can be implemented like
>
>Encode::EUC (does all euc-based conversion)
>Encode::JP  (Wrapper module that calls Encode::EUC and Encode::ISO2022)
>Encode::KR
>Encode::ZN
>
>   and so forth.
>   Actually even more table reduction can be done between SHIFT_JIS and
>EUC.  They are all based upon JISX0208 (and 0201 and 0212) so simple
>calculation
>converts one another.

Ah.

>
>> We could ship things pre-compiled (with origianal .ucm's gzipped, or
>> provide a way to extract a .ucm from the compiled form).
>> Also the compile process is all in perl and has not really been tunned.
>> It spends a lot of time trying to find common "strings" (which gets
>> tables
>> down in size so is a win.)
>
>   Right.  How we do that we still need more experiments but this is what
>should be done....
>
>Dan
--
Nick Ing-Simmons
http://www.ni-s.u-net.com/

Re: Encode::XS for CJK

Reply via email to