tr/// and use encoding

Dan Kogai Thu, 03 Oct 2002 04:06:09 -0700

On Thursday, Oct 3, 2002, at 11:29 Asia/Tokyo, Jarkko Hietaniemi wrote:
> On Wed, Oct 02, 2002 at 10:44:06PM +0900, Dan Kogai wrote:
>> On Wednesday, Oct 2, 2002, at 22:34 Asia/Tokyo, Jarkko Hietaniemi 
>> wrote:
>> Both.  I think the operation needed is straight-forward.  When you get
>> tr[LHS][RHS], decode'em then
>> feed it to the naked tr// .
>
> Urk...  That means a dip into the toke.c, how the tr/// ranges are
> implemented is... tricky.  sv_recode_to_utf8() is needed somewhere...
> but I'm a little bit pressed for time right now.  I suggest you
> perlbug this and move the process to perl5-porters.  (Inaba Hiroto
> also might have insight on this; he's the tr///-with-Unicode sensei,
> really-- he practically implemented all of it.  And he might read
> *[gk]ana much better than me :-)


So now this thread is in perl5-porter.  Since this "undocumented (lack 
of) feature" has a very easy workaround, I am yet to perlbug this.

=head1 PROBLEM

C<use encoding 'foo-encoding'> nicely converts string literals and 
regex into UTF-8 so you gen get the power of perl 5.8.0 even when your 
source code is other text encodings than UTF-8.  But tr/// does not 
embrace this magic.

=head1 WORKAROUND

Suppose your script is in EUC-JP and your source contains this:

   $kana =~ tr/\xA4\xA1-\xA4\xF3/\xA5\xA1-\xA5\xF3/;
               -------- -------- -------- --------

And you want perl to do the following;

   $kana =~ tr/\x{3041}-\x{3093}/\x{30a1}-\x{30f3}/

All you have to do is:

   use encoding 'euc-jp';
   # ....
   eval qq{ \$kana =~ tr/\xA4\xA1-\xA4\xF3/\xA5\xA1-\xA5\xF3/ };

=over

=item chars in this example

   utf8     euc-jp   charnames::viacode()
   -----------------------------------------
   \x{3041} \xA4\xA1 HIRAGANA LETTER SMALL A
   \x{3093} \xA4\xF3 HIRAGANA LETTER N
   \x{30a1} \xA5\xA1 KATAKANA LETTER SMALL A
   \x{30f3} \xA5\xF3 KATAKANA LETTER N

=backs

=head1 DISCUSSION

I found this when I was writing a CGI book and I wanted a form 
validation/correction.  THe example above converts all Hiragana to 
Kanakana, which is a common task in Japan.  Traditionally this kind of 
operation was done via jcode::tr() (require "jcode.pl";) or Jcode::tr() 
(use Jcode;).  But as of perl 5.6.0 you can apply Japanese directly 
into regex and tr/// -- so long as your script is in UTF-8.

With perl 5.8.0, the direct application of multibyte regex was made 
possible via C<use encoding> pragma.  use encoding pragma applies its 
magic as follows.  Suppose you C<use encoding 'foo'>;

=over

=item 0.

${^ENCODING}, a special, non-scoped variable, is set to 
C<Encode::find_encoding('foo')>.  if 'foo' is a supported encoding by 
Encode, ${^ENCODING} is now a "transcoder" object.

=item 1.

all string literals in q//, qq//, qw// and qr// (not sure of qx//) are 
first fed to ${^ENCODING}.->decode().  So from perl's point of view, 
it's the same as literals written in UTF-8.

=item 2.

C<binmode STDIN, ":encoding(foo)";> and C<binmode STDIN, 
":encoding(foo)"> are implicitly applied So you can feed STDIN in 
enconding 'foo' and get STDOUT in encoding 'foo'

=back

Very clever and powerful.  But 1. is not done to tr///.  qq{} is under 
control of C<use encoding> so eval qq{} works as expected.

Though the workaround is simple, easy and clever it still leaves 
inconsistency on how ${^ENCODING} gets used;  It does indeed works on 
non-interpolated literals already.

=head1 REPORTED BY

Dan the Encode Maintainer E<lt>[EMAIL PROTECTED]<gt>

tr/// and use encoding

Reply via email to