Re: Encode::Tcl Mistery Solved!

Nick Ing-Simmons Tue, 29 Jan 2002 04:36:02 -0800

Dan Kogai <[EMAIL PROTECTED]> writes:
>
>my $utf8_file = "t/table.utf8"; # Valid UTF8 text file
>my $utf8_data;
>open my $fh, $utf8_file or die "$utf8_file:$!";


That is supposed to be :

open my $fh,"<:utf8", $utf8_file;

To tell perl that data is UTF-8.

>read $fh, $utf8_data, -s $utf8_file;
>close $fh;
>
>BEGIN { plan tests => 2 ; }
>ok(encode('euc-jp', $utf8_data), $euc_data);
>ok(decode('euc-jp', $euc_data), $utf8_data);
>__END__
>
>   Will it work?  NO!  It will fail like this.
>
>> not ok 1
>> # Test 1 got: <UNDEF> (t/classic.pl at line 24)
>> #   Expected: '0x0020:
>>  ...
>
>   You fed pre-certified data and still fails.  What's wrong?
>   The answer is:  $utf8 is no utf8 UNLESS YOU EXPLICITLY SPECIFY
>SOMEWHERE!

Yes - things are sequences of iso-8859-1 until told otherwise.

>insert
>
>       Encode::_utf8_on($utf8_data);
>
>   before ok() and now it works.  You can also make it work by replacing
>
>       open my $fh, $utf8_file or die "$utf8_file:$!";
>
>   to
>
>       open my $fh, "<:utf8" $utf8_file or die "$utf8_file:$!";

Which is the prefered way.

>
>   Encoding engines themselves appears ok.
>   I repeat. ENCODING ENGINES THEMSELVES APPEARS OK!

I think we knew that ;-)

>   Am I dumb to take so long to find this out?  Maybe.  But code
>obfuscation, misleading error message and erroneous document is
>definitely also to blame.
>   If encode() demands an SV explicitly marked as UTF8, it should carp
>BEFORE it attempts to encode from the first place.

It doesn't. If it is not marked as UTF-8 it assumes it isn't. So
(Jarkko's locale stuff aside) it is a sequence of iso-8859-1 chars
for legacy compatibility. You then ask it to convert those bytes to
EUC-JP and lots of high-bit iso-8859-1's (which is what UTF8 encoded
stuff looks like) don't map so you get undefs.

Back to locale ... The idea of the locale stuff is to say "aha - user is in a Japanese 
locale
so in absence of instructions to the contrary I will assume that files
are full of iso2022-jp encoded stuff" (or whatever is right thing).
So you will still need to explicitly tell it when you are breaking
that assumption.

>   I also found croaking in (en|de)code is problematic in such occasion
>that you need to determine encodings dynamically.  With this in mind, I
>made changes to encode() and decode() as follows;
>
>sub encode
>{
>     my ($name,$string,$check) = @_;
>     my $enc = find_encoding($name);
>     unless (defined $enc){
>          # Maybe we should set $Encode::$! or something instead....
>          # or should we cast _utf8_on()?
>         carp("Unknown encoding '$name'");
>         return;
>     }
>     unless (is_utf8($string)){
>         $check += 0; # numify when empty
>         carp("¥$string is not UTF-8: encode('$name', ¥$string, $check)");

I assume that ESC sequences are iso2022 - this is also "the wrong thing".
Eventually carp is going to write to STDERR stream at it may "know" that
STDERR is iso2022 and do the right thing.

>         return;
>     }
>     my $octets = $enc->encode($string,$check);
>     return undef if ($check && length($string));
>     return $octets;
>}
>
>sub decode
>{
>     my ($name,$octets,$check) = @_;
>     my $enc = find_encoding($name);
>     unless(defined $enc){
>         carp("Unknown encoding '$name'");
>         return;
>     }
>     my $string = $enc->decode($octets,$check);
>     $_[1] = $octets if $check;
>     return $string;
>}
>
>   There are other places where croak() that should carp() but I'll wait
>next breadperl to commit these changes.

The idea of the croak is you can catch it silently with

eval { $string = decode($trythis,... }
(or better yet call find_encoding yourself before getting that far).

The carp is going to leak out to the user and look messy.

>   So much as I feel relieved now, I still feel uncomfortable on the API
>of Encode.  UTF8 flag must be explicitly set yet the use of _utf8_on()
>is depreciated.

Yes you are supposed to set it on the file handle. Setting it on
may be appropriate if data comes in magically from somewhere else.

>I am looking for a more elegant way to handle this....
>
>Dan the Man with too Many Charsets to Handle.
--
Nick Ing-Simmons
http://www.ni-s.u-net.com/

Re: Encode::Tcl Mistery Solved!

Reply via email to