Encode::Tcl Mistery Solved!

Dan Kogai Mon, 28 Jan 2002 20:52:29 -0800

Folks,

   I think I have finally found the mistery why Encode::Tcl's decode() 
works while encode() did not.  It was quite simple after all.
   First please take a look at this code.  Both table.euc and table.utf8 
are guaranteed to be valid.


#!/path/to/perl5.7.2
use strict
use Test;
use Encode;
use Encode::Tcl;

my $euc_file = "t/table.euc";   # Valid EUC-JP text file
my $euc_data;
open my $fh, $euc_file or die "$euc_file:$!";
read $fh, $euc_data, -s $euc_file;
close $fh;

my $utf8_file = "t/table.utf8"; # Valid UTF8 text file
my $utf8_data;
open my $fh, $utf8_file or die "$utf8_file:$!";
read $fh, $utf8_data, -s $utf8_file;
close $fh;

BEGIN { plan tests => 2 ; }
ok(encode('euc-jp', $utf8_data), $euc_data);
ok(decode('euc-jp', $euc_data), $utf8_data);
__END__

   Will it work?  NO!  It will fail like this.

> not ok 1
> # Test 1 got: <UNDEF> (t/classic.pl at line 24)
> #   Expected: '0x0020:
>  ...

   You fed pre-certified data and still fails.  What's wrong?
   The answer is:  $utf8 is no utf8 UNLESS YOU EXPLICITLY SPECIFY 
SOMEWHERE!
insert

        Encode::_utf8_on($utf8_data);

   before ok() and now it works.  You can also make it work by replacing

        open my $fh, $utf8_file or die "$utf8_file:$!";

   to

        open my $fh, "<:utf8" $utf8_file or die "$utf8_file:$!";

   Encoding engines themselves appears ok.
   I repeat. ENCODING ENGINES THEMSELVES APPEARS OK!
   Am I dumb to take so long to find this out?  Maybe.  But code 
obfuscation, misleading error message and erroneous document is 
definitely also to blame.
   If encode() demands an SV explicitly marked as UTF8, it should carp 
BEFORE it attempts to encode from the first place.
   I also found croaking in (en|de)code is problematic in such occasion 
that you need to determine encodings dynamically.  With this in mind, I 
made changes to encode() and decode() as follows;

sub encode
{
     my ($name,$string,$check) = @_;
     my $enc = find_encoding($name);
     unless (defined $enc){
           # Maybe we should set $Encode::$! or something instead....
           # or should we cast _utf8_on()?
         carp("Unknown encoding '$name'");
         return;
     }
     unless (is_utf8($string)){
         $check += 0; # numify when empty
         carp("¥$string is not UTF-8: encode('$name', ¥$string, $check)");
         return;
     }
     my $octets = $enc->encode($string,$check);
     return undef if ($check && length($string));
     return $octets;
}

sub decode
{
     my ($name,$octets,$check) = @_;
     my $enc = find_encoding($name);
     unless(defined $enc){
         carp("Unknown encoding '$name'");
         return;
     }
     my $string = $enc->decode($octets,$check);
     $_[1] = $octets if $check;
     return $string;
}

   There are other places where croak() that should carp() but I'll wait 
next breadperl to commit these changes.
   So much as I feel relieved now, I still feel uncomfortable on the API 
of Encode.  UTF8 flag must be explicitly set yet the use of _utf8_on() 
is depreciated.  I am looking for a more elegant way to handle this....

Dan the Man with too Many Charsets to Handle.

Encode::Tcl Mistery Solved!

Reply via email to