Re: Encode::XS for CJK

SADAHIRO Tomoyuki Thu, 31 Jan 2002 06:57:13 -0800


Nick Ing-Simmons <[EMAIL PROTECTED]> wrote:


> Dan Kogai <[EMAIL PROTECTED]> writes:
> >First, thank you all for perl@14503.
> >
> >On 2002.01.30, at 07:07, Nick Ing-Simmons wrote:
> >> If I run the compile script on it and build Encode::EUC_JP
> >> as an XS extension and change Encode::Tcl to ....
> >
> >   I also made Encode::JP::SHIFTJIS, with Encode::EUC_JP as a template
> >(Also relocated Encode::EUC_JP to Encode::JP:EUC_JP) and it also
> >worked.  I have a feeling this will work for other CJK.
> >   Now the problem is escape-based codings such as ISO-2022.
> 
> Can you explain the way those work?
> I can imagine two ways for decode:
> A  - keep going with current sub-encoding till we get a fail,
>      then look at next few octets for an escape sequence.
> B. - Scan ahead for next escape sequence (or end of available input)
>      then translate up to that.
> 
> A. Is easy - but as all escape sequences seem to be valid ASCII does not
>    work.
> B. requires an irritating double scan.

Encode::Tcl is non-A non-B (but B would be better).

If the next byte is an "escape",
   then invokes a new sub-encoding (CCS);
else
   decodes and converts it to unicode.

The escape octets for Encode::Tcl::Escape are ESC, SI, and SO;
those for Encode::Tcl::Extend are SS2 ("\x8E") and SS3 ("\x8F");
that for Encode::Tcl::HanZi is '~', the tilde.

Any octet sequence till the next "escape" octet
could be tried to translate.

> For encode there is a different pain. For each code point we need an
> efficent way to find out whether a sub-encoding can represent that
> point. A bit map of 0x10FFFF entries does not seem good, so it is
> either an auxillary table, or try-it-and-see (which should not be too bad
> with C version).

Encode::Tcl tries and sees.

> --
> Nick Ing-Simmons
> http://www.ni-s.u-net.com/

SADAHIRO Tomoyuki

Re: Encode::XS for CJK

Reply via email to