Nick Ing-Simmons <[EMAIL PROTECTED]> wrote:
> Dan Kogai <[EMAIL PROTECTED]> writes: > >First, thank you all for perl@14503. > > > >On 2002.01.30, at 07:07, Nick Ing-Simmons wrote: > >> If I run the compile script on it and build Encode::EUC_JP > >> as an XS extension and change Encode::Tcl to .... > > > > I also made Encode::JP::SHIFTJIS, with Encode::EUC_JP as a template > >(Also relocated Encode::EUC_JP to Encode::JP:EUC_JP) and it also > >worked. I have a feeling this will work for other CJK. > > Now the problem is escape-based codings such as ISO-2022. > > Can you explain the way those work? > I can imagine two ways for decode: > A - keep going with current sub-encoding till we get a fail, > then look at next few octets for an escape sequence. > B. - Scan ahead for next escape sequence (or end of available input) > then translate up to that. > > A. Is easy - but as all escape sequences seem to be valid ASCII does not > work. > B. requires an irritating double scan. Encode::Tcl is non-A non-B (but B would be better). If the next byte is an "escape", then invokes a new sub-encoding (CCS); else decodes and converts it to unicode. The escape octets for Encode::Tcl::Escape are ESC, SI, and SO; those for Encode::Tcl::Extend are SS2 ("\x8E") and SS3 ("\x8F"); that for Encode::Tcl::HanZi is '~', the tilde. Any octet sequence till the next "escape" octet could be tried to translate. > For encode there is a different pain. For each code point we need an > efficent way to find out whether a sub-encoding can represent that > point. A bit map of 0x10FFFF entries does not seem good, so it is > either an auxillary table, or try-it-and-see (which should not be too bad > with C version). Encode::Tcl tries and sees. > -- > Nick Ing-Simmons > http://www.ni-s.u-net.com/ SADAHIRO Tomoyuki