Dan Kogai <[EMAIL PROTECTED]> writes: >Hi jhi, > > My name is Dan Kogai. I am a writer of Jcode.pm, which converts from >various Japanese charset to others. With the advent of Encode module >that comes with Perl 5.7.2 and up, I finally though that the role of >Jcode, Jcode to r.i.p. When I tested the module however, I found it was >far from it. Rather, I believe I can help in great deal with the >current implementation.
Excellent ! ;-) > >Problem #1: Where is the rest of charset!? > > When perl5.7.2 gets installed, it installs bunch of .enc files under >Encoding/, including good-old euc-jp. but when you > >perl5.7.2 -MEncode -e 'print join(",", encodings), "\n";' > > You get > >koi8-r,dingbats,iso-8859-10,iso-8859-13,cp37,iso-8859-9,iso-8859-6,iso-8859-1, >cp1047,iso-8859-4,Internal,iso-8859-2,symbol,iso-8859-3,US- >ascii,iso-8859-8,iso-8859-14,UCS-2,iso-8859-5,UTF-8,iso-8859-7,iso-8859-15, >cp1250,iso-8859-16,posix-bc > > Those are only 8-bit chars. That was a deliberate decision on my part. Including "all" the ASCII-oid 8-bit encodings in their "compiled" form does not use much memory (as they share at least 1/2 the space for the ASCII part). The compiled forms of the multibyte and two-byte encodings are larger. So I envisage -MEncode=japanese (say) to load clusters. >I was at first disappointed but I thought >it over and found Encode::Tcl module which contains no document. I read >the document over and over and finally found > >perl5.7.2 -MEncode -MEncode::Tcl -e 'print join(",", encodings), "\n";' > > That gave me > >gb1988,cp857,macUkraine,dingbats,iso2022-jp,iso-8859-10,ksc5601,iso-8859-13, >iso-8859-6,macTurkish,Internal,symbol,macJapan,iso2022,cp1250,posix- >bc,cp1251,koi8-r,7bit- >kr,cp437,cp866,iso-8859-3,cp874,iso-8859-8,macCyrillic,UCS-2,shiftjis,UTF-8, >euc-jp,cp862,7bit- >kana,cp861,cp860,macCroatian,jis0208,cp1254,cp37,iso-8859-9,7bit- >jis,macGreek,big5,cp852,cp869,macCentEuro,iso-8859-1,cp1047,cp863,macIceland, >macRoman,euc- >kr,gsm0338,cp775,cp950,cp1253,cp424,cp856,cp850,iso-8859-16,cp1256,cp737,cp1252, >macDingbats,jis0212,iso2022-kr,cp1006,euc- >cn,cp949,cp855,gb2312,cp1255,iso-8859-4,iso-8859-2,cp1258,jis0201,cp864,US-ascii, >cp936,iso-8859-14,iso-8859-5,iso-8859-7,iso-8859-15,cp865,macThai,HZ,macRomania, >cp1257,gb12345,cp932 Encode::Tcl is SADAHIRO Tomoyuki's fixup/enhancement of the pure perl version we used for a while before I invented the compiled form. The Tcl-oid version is slow. The .enc files are lifted straight from Tcl. It is unclear to me where the mappings come from. Modern Encode has C code that processes a compiled form and can compile ICU-like .ucm files as well as .enc. The ICU form can represent fallbacks and non-reversible stuff as well. At that point in the coding it became unclear whether we could use ICU stuff - I think we have since concluded that we can. > > And I smiled and then wrote a test code as follows; > >Problem #2: Does it really work? > > So here is the code #1 that encodes and decodes depending depending on >the option. > >#!/usr/local/bin/perl5.7.2 > >use strict; >use Encode; >use Encode::Tcl; > >my ($which, $from, $to) = @ARGV; >my $op = $which =~ /e/ ? ¥&encode : > $which =~ /d/ ? ¥&decode : die "$0 [-[e|d]c] from to¥n"; >my $check = $which =~ /c/; >$check and warn "check set.¥n"; > >open my $in, "<$from" or die "$from:$!"; >open my $out, ">$to" or die "$to:$!"; > >while(defined(my $line = <$in>)){ > use bytes; File IO of encoded or UTF-8 data is very very messy prior to perl5.7 At best 'use bytes' is a hack. > # or print bitches as follows; > # Wide character in print at ./classic.pl line 15, <$in> line 260. > print $out $op->('euc-jp', $line, $check); > >} >__END__ > > It APPEARS to (en|de)code chars -- with lots of problem. > I fed Jcode/t/table.euc, the file that contains all characters defined >in JISX201 and JISX208. Jcode tests it self by converting that file >then back. If (en|de)coder is OK, euc-jp -> utf8 -> euc-jp must convert >the character back. In case of the code above it did not. However >many of the characters appeared converted. Emacs failed to >auto-recognize the character format but when I fed the resulting files >to JEdit with character set explicitly specified, there appears >converted characters. > > Then I also tried this one. > >#!/usr/local/bin/perl5.7.2 > >use strict; >use Encode; >use Encode::Tcl; > >my ($which, $from, $to) = @ARGV; >my ($op, $icode, $ocode); >if ($which =~ /e/){ > $icode = "utf8"; $ocode="encoding('euc-jp')"; >}elsif($which =~ /d/){ > $icode = "encoding('euc-jp')"; $ocode="utf8"; >}else{ > die "$0 -[e|d] from to¥n"; >} > >open my $in, "<:$icode", $from or die "$from:$!"; >open my $out, ">:$ocode", $to or die "$to:$!"; > >while(defined(my $line = <$in>)){ > use bytes; ^^^^^^^^^^ Catastrophic I would guess. use bytes says "I know exactly what I am doing" and so even though perl knows better it believes you and fails to UTF-8-ify things etc. > print $out $line; > >} >__END__ > > A new style. It does convert but convert differently from the >previous code. Also this > >> Cannot find encoding "'euc-jp'" at ./newway.pl line 17. >> :Invalid argument. > > appears for some reasons. > I can only say Encode is far from production level, so far as Japanese >charset is concerned. I would agree. It would be good to have some test data in various encodings. This is easy for 8-bit encodings 0..255 is all you need. But for 16-bit encodings (with gaps) and in particular multi-byte encodings you need a "sensible" starting sample. > >Problem #3: How about performance? > > It's silly to talk about perfomance before code runs right from the >first place but I could not help checking it out > Encode::Tcl implements conversion by filling lookup-table on-the-fly. >That's what Jcode::Unicode::NoXS does (well, mine uses lookup hash, >though). How's the performance? I naturally benchmarked. > >#!/usr/local/bin/perl5.7.2 > >use Benchmark; >use Encode; >use Encode::Tcl; >use Jcode; > >my $count = $ARGV[0] || 1; > >sub subread{ > open my $fh, 'table.euc'; > read $fh, my $eucstr, -f 'table.euc'; > undef $fh; >} > >timethese($count, > { > "Encode::Tcl" => > sub { my $decoded = decode('euc-jp', $eucstr, 1) }, > "Jcode" => > sub { my $decoded = Jcode::convert($eucstr, 'utf8', >'euc') }, > } > ); >__END__ > >And here is the result. > >Benchmark: timing 1 iterations of Encode::Tcl, Jcode... >Encode::Tcl: 1 wallclock secs ( 0.28 usr + 0.00 sys = 0.28 CPU) @ >3.57/s (n=1) > (warning: too few iterations for a reliable count) > Jcode: 0 wallclock secs ( 0.02 usr + 0.00 sys = 0.02 CPU) @ >50.00/s (n=1) > (warning: too few iterations for a reliable count) >Benchmark: timing 100 iterations of Encode::Tcl, Jcode... >Encode::Tcl: 1 wallclock secs ( 0.32 usr + 0.00 sys = 0.32 CPU) @ >312.50/s (n=100) > (warning: too few iterations for a reliable count) > Jcode: 0 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU) @ >3333.33/s (n=100) > (warning: too few iterations for a reliable count) >Benchmark: timing 1000 iterations of Encode::Tcl, Jcode... >Encode::Tcl: 1 wallclock secs ( 0.38 usr + 0.00 sys = 0.38 CPU) @ >2631.58/s (n=1000) > (warning: too few iterations for a reliable count) > Jcode: 1 wallclock secs ( 0.11 usr + 0.00 sys = 0.11 CPU) @ >9090.91/s (n=1000) > (warning: too few iterations for a reliable count) > > Just as I guessed. The first invocation of Encode::Tcl is way slow >because it has to fill the lookup table. It gets faster as time goes >by. The current implementation of Jcode (with XS) also suffers the >performance problem on utf8 because it first converts the chars to UCS2 >then UTF8. > >#4; Conclusion > > I think I have grokked both in fullness to implement >Encode::Japanese. I know you don't grok Japanese very well (which you >don't have to; I don't grok Finnish either :). It takes more than a >simple table lookup to handle Japanese well enough to make native >grokkers happy. It has to automatically detect which of many charsets >are used, it has to be robust, and most of all, it must be documented in >Japanese :) I can do all that. > I believe Jcode must someday cease to exist as Camel starts to grok >Japanese. With Encode module the day is sooner than I expected and I >want to help you make my day. > If I submit Encode::Japanese, are you going to merge it standard >module? I encourage you to look at Encode/encengine.c - it is a state machine which reads tables to transform octet-sequences. It is a lot faster than Encode::Tcl scheme. I _think_ Encode/compile (which builds the tables) does right thing for multi-byte and 16-bit encodings but as I have no reliable test data, viewer or judgement of end result I cannot be sure. What I would like to see is : A. A review of Encode's APIs and principles to make sure I have not done anything really stupid. Both API from perl script's perspective and also the API/mechanism that it expects an Encoding "plugin" to provide. B. "Blessing" of the Xxxxx <-> Unicode mappings for various encodings. Are Tcl's "good enough" or should we use ICU's or Unicode's or ... ? C. Point me at "clusters" of related encodings that are often used together and I can have a crack and building "compiled" XS module that provides those encodings. D. Some discussion as to how to handle escape encodings and/or heuristics for guessing encoding. I had some 1/4 thought out ideas for how to get encengine.c to assist on these too - but I have probably forgotten them. > >Dan the Man with Too Many Charsets to Deal With -- Nick Ing-Simmons http://www.ni-s.u-net.com/