Re: Encode::XS for CJK

Nick Ing-Simmons Wed, 30 Jan 2002 23:57:33 -0800

Dan Kogai <[EMAIL PROTECTED]> writes:
>First, thank you all for perl@14503.
>
>On 2002.01.30, at 07:07, Nick Ing-Simmons wrote:
>> If I run the compile script on it and build Encode::EUC_JP
>> as an XS extension and change Encode::Tcl to ....
>
>   I also made Encode::JP::SHIFTJIS, with Encode::EUC_JP as a template
>(Also relocated Encode::EUC_JP to Encode::JP:EUC_JP) and it also
>worked.  I have a feeling this will work for other CJK.
>   Now the problem is escape-based codings such as ISO-2022.


Can you explain the way those work?
I can imagine two ways for decode:
A  - keep going with current sub-encoding till we get a fail,
     then look at next few octets for an escape sequence.
B. - Scan ahead for next escape sequence (or end of available input)
     then translate up to that.

A. Is easy - but as all escape sequences seem to be valid ASCII does not
   work.
B. requires an irritating double scan.

For encode there is a different pain. For each code point we need an
efficent way to find out whether a sub-encoding can represent that
point. A bit map of 0x10FFFF entries does not seem good, so it is
either an auxillary table, or try-it-and-see (which should not be too bad
with C version).


>   Another small problem is that XS-based encoding consumes a whole
>directory immediately under perl/ext/Encode.  Well, I can live with a
>few dozens more.

You could bundle several encodings in one XS (the way Encode itself
bundles ASCII, ios-8859-* and koi8).
If any of the bundled encodings have similar sequences of code points
then we will get overall table size reductions too.

In the limit one could have Encode::CJK, but perhaps
Encode::JP / Encode::CN / Encode::KR makes more sense ???

>   And the speed of the compile script may be a problem if we want all
>CJK to be XS-based.  It roughly takes about 25 seconds to compile single
>CJK encoding on my FreeBSD box.  Well, I can live with that too but
>other porters may find it frustrating....

We could ship things pre-compiled (with origianal .ucm's gzipped, or
provide a way to extract a .ucm from the compiled form).
Also the compile process is all in perl and has not really been tunned.
It spends a lot of time trying to find common "strings" (which gets tables
down in size so is a win.)

>   I think we are making a significant progress in CJK....
>
>Dan
--
Nick Ing-Simmons
http://www.ni-s.u-net.com/

Re: Encode::XS for CJK

Reply via email to