Dan Kogai <[EMAIL PROTECTED]> writes: >On 2002.01.31, at 17:13, Nick Ing-Simmons wrote: >>> Now the problem is escape-based codings such as ISO-2022. >> >> Can you explain the way those work? >> I can imagine two ways for decode: >> A - keep going with current sub-encoding till we get a fail, >> then look at next few octets for an escape sequence. >> B. - Scan ahead for next escape sequence (or end of available input) >> then translate up to that. > > To answer these questions, let's see what the existing utilities do. >Here I will discuss NKF, jcode.pl and my humble Jcode. > > >jcode.pl ftp://ftp.iij.ad.jp/pub/IIJ/dist/utashiro/perl/ > >* 1st appeared in 1992, BEFORE Perl5 >* still maintained; still widely used for the same reason cgi-lib.pl is > used instead of CGI.pm >* Written 100% in perl >* No Unicode support >* Method C? Just like method B but it uses regex to grab between escape > boundaries (see Jcode.pm) below
So would I if doing it in perl. > >> You could bundle several encodings in one XS (the way Encode itself >> bundles ASCII, ios-8859-* and koi8). > > I know. But it is bulky As a quick hack - Tried bundling : 'euc-jp.ucm', 'jis0201.enc', 'jis0212.enc', 'jis0208.enc', 'shiftjis.enc', The resulting XS's string table was only slightly larger than the one for euc-jp.ucm on its own. (But time to compile was much longer.) >and another problem is that Tcl has a >different notion of 'Escape' (like euc_jp_0212, which is not exactly an >escape but an extension) Tcl has both E (escape) encoding and X (eXtension) encoding as type fields. I don't remember that from tcl/tk ... >which needs to be corrected for the practical >use. > >> If any of the bundled encodings have similar sequences of code points >> then we will get overall table size reductions too. >> >> In the limit one could have Encode::CJK, but perhaps >> Encode::JP / Encode::CN / Encode::KR makes more sense ??? > > Right. From a user's point of view distinct package space for each >(human) language is better. But again, this can be implemented like > >Encode::EUC (does all euc-based conversion) >Encode::JP (Wrapper module that calls Encode::EUC and Encode::ISO2022) >Encode::KR >Encode::ZN > > and so forth. > Actually even more table reduction can be done between SHIFT_JIS and >EUC. They are all based upon JISX0208 (and 0201 and 0212) so simple >calculation >converts one another. Ah. > >> We could ship things pre-compiled (with origianal .ucm's gzipped, or >> provide a way to extract a .ucm from the compiled form). >> Also the compile process is all in perl and has not really been tunned. >> It spends a lot of time trying to find common "strings" (which gets >> tables >> down in size so is a win.) > > Right. How we do that we still need more experiments but this is what >should be done.... > >Dan -- Nick Ing-Simmons http://www.ni-s.u-net.com/