On Thursday, Oct 3, 2002, at 11:29 Asia/Tokyo, Jarkko Hietaniemi wrote: > On Wed, Oct 02, 2002 at 10:44:06PM +0900, Dan Kogai wrote: >> On Wednesday, Oct 2, 2002, at 22:34 Asia/Tokyo, Jarkko Hietaniemi >> wrote: >> Both. I think the operation needed is straight-forward. When you get >> tr[LHS][RHS], decode'em then >> feed it to the naked tr// . > > Urk... That means a dip into the toke.c, how the tr/// ranges are > implemented is... tricky. sv_recode_to_utf8() is needed somewhere... > but I'm a little bit pressed for time right now. I suggest you > perlbug this and move the process to perl5-porters. (Inaba Hiroto > also might have insight on this; he's the tr///-with-Unicode sensei, > really-- he practically implemented all of it. And he might read > *[gk]ana much better than me :-)
So now this thread is in perl5-porter. Since this "undocumented (lack of) feature" has a very easy workaround, I am yet to perlbug this. =head1 PROBLEM C<use encoding 'foo-encoding'> nicely converts string literals and regex into UTF-8 so you gen get the power of perl 5.8.0 even when your source code is other text encodings than UTF-8. But tr/// does not embrace this magic. =head1 WORKAROUND Suppose your script is in EUC-JP and your source contains this: $kana =~ tr/\xA4\xA1-\xA4\xF3/\xA5\xA1-\xA5\xF3/; -------- -------- -------- -------- And you want perl to do the following; $kana =~ tr/\x{3041}-\x{3093}/\x{30a1}-\x{30f3}/ All you have to do is: use encoding 'euc-jp'; # .... eval qq{ \$kana =~ tr/\xA4\xA1-\xA4\xF3/\xA5\xA1-\xA5\xF3/ }; =over =item chars in this example utf8 euc-jp charnames::viacode() ----------------------------------------- \x{3041} \xA4\xA1 HIRAGANA LETTER SMALL A \x{3093} \xA4\xF3 HIRAGANA LETTER N \x{30a1} \xA5\xA1 KATAKANA LETTER SMALL A \x{30f3} \xA5\xF3 KATAKANA LETTER N =backs =head1 DISCUSSION I found this when I was writing a CGI book and I wanted a form validation/correction. THe example above converts all Hiragana to Kanakana, which is a common task in Japan. Traditionally this kind of operation was done via jcode::tr() (require "jcode.pl";) or Jcode::tr() (use Jcode;). But as of perl 5.6.0 you can apply Japanese directly into regex and tr/// -- so long as your script is in UTF-8. With perl 5.8.0, the direct application of multibyte regex was made possible via C<use encoding> pragma. use encoding pragma applies its magic as follows. Suppose you C<use encoding 'foo'>; =over =item 0. ${^ENCODING}, a special, non-scoped variable, is set to C<Encode::find_encoding('foo')>. if 'foo' is a supported encoding by Encode, ${^ENCODING} is now a "transcoder" object. =item 1. all string literals in q//, qq//, qw// and qr// (not sure of qx//) are first fed to ${^ENCODING}.->decode(). So from perl's point of view, it's the same as literals written in UTF-8. =item 2. C<binmode STDIN, ":encoding(foo)";> and C<binmode STDIN, ":encoding(foo)"> are implicitly applied So you can feed STDIN in enconding 'foo' and get STDOUT in encoding 'foo' =back Very clever and powerful. But 1. is not done to tr///. qq{} is under control of C<use encoding> so eval qq{} works as expected. Though the workaround is simple, easy and clever it still leaves inconsistency on how ${^ENCODING} gets used; It does indeed works on non-interpolated literals already. =head1 REPORTED BY Dan the Encode Maintainer E<lt>[EMAIL PROTECTED]<gt>