On 2002.01.31, at 17:13, Nick Ing-Simmons wrote: >> Now the problem is escape-based codings such as ISO-2022. > > Can you explain the way those work? > I can imagine two ways for decode: > A - keep going with current sub-encoding till we get a fail, > then look at next few octets for an escape sequence. > B. - Scan ahead for next escape sequence (or end of available input) > then translate up to that.
To answer these questions, let's see what the existing utilities do. Here I will discuss NKF, jcode.pl and my humble Jcode. NKF (Network Kanji Filter) ftp://ftp.ie.u- ryukyu.ac.jp/pub/software/kono/ * 1st appeared in 1987. Still maintained. * Handles EUC-JP, JIS (ISO-2022-JP) and SHIFT_JIS. * No Unicode support to the date. This is understandable because other "Legacy" encodings needs no conversion table since they are all based upon JISX2xx. * Stream based. No buffer allocation and such (later changed when NKF.pm was added to the distribution. But even this case NKF.xs just does buffer handling and nkf(1) does no in-memory conversion). * Uses method B for ISO-2022 (or my ungetc() !). jcode.pl ftp://ftp.iij.ad.jp/pub/IIJ/dist/utashiro/perl/ * 1st appeared in 1992, BEFORE Perl5 * still maintained; still widely used for the same reason cgi-lib.pl is used instead of CGI.pm * Written 100% in perl * No Unicode support * Method C? Just like method B but it uses regex to grab between escape boundaries (see Jcode.pm) below Jcode.pm http://www.openlab.gr.jp/Jcode/ * 1st appeared in 1999 * Unicode support added (XS and NoXS both supported) * object model * internal routines based upon jcode.pl * and here is the piece of sub that does JIS -> EUC conversion sub jis_euc { my $thingy = shift; my $r_str = ref $thingy ? $thingy : \$thingy; $$r_str =~ s( ($RE{JIS_0212}|$RE{JIS_0208}|$RE{JIS_ASC}|$RE{JIS_KANA}) ([^¥e]*) ) { my ($esc, $str) = ($1, $2); if ($esc !~ /$RE{JIS_ASC}/o) { $str =~ tr/\x21-\x7e/\xa1-\xfe/; if ($esc =~ /$RE{JIS_KANA}/o) { $str =~ s/([\xa1-\xdf])/\x8e$1/og; } elsif ($esc =~ /$RE{JIS_0212}/o) { $str =~ s/([\xa1-\xfe][\xa1-\xfe])/\x8f$1/og; } } $str; }geox; $$r_str; } > You could bundle several encodings in one XS (the way Encode itself > bundles ASCII, ios-8859-* and koi8). I know. But it is bulky and another problem is that Tcl has a different notion of 'Escape' (like euc_jp_0212, which is not exactly an escape but an extension) which needs to be corrected for the practical use. > If any of the bundled encodings have similar sequences of code points > then we will get overall table size reductions too. > > In the limit one could have Encode::CJK, but perhaps > Encode::JP / Encode::CN / Encode::KR makes more sense ??? Right. From a user's point of view distinct package space for each (human) language is better. But again, this can be implemented like Encode::EUC (does all euc-based conversion) Encode::JP (Wrapper module that calls Encode::EUC and Encode::ISO2022) Encode::KR Encode::ZN and so forth. Actually even more table reduction can be done between SHIFT_JIS and EUC. They are all based upon JISX0208 (and 0201 and 0212) so simple calculation converts one another. > We could ship things pre-compiled (with origianal .ucm's gzipped, or > provide a way to extract a .ucm from the compiled form). > Also the compile process is all in perl and has not really been tunned. > It spends a lot of time trying to find common "strings" (which gets > tables > down in size so is a win.) Right. How we do that we still need more experiments but this is what should be done.... Dan