Dan Kogai <[EMAIL PROTECTED]> writes: > >my $utf8_file = "t/table.utf8"; # Valid UTF8 text file >my $utf8_data; >open my $fh, $utf8_file or die "$utf8_file:$!";
That is supposed to be : open my $fh,"<:utf8", $utf8_file; To tell perl that data is UTF-8. >read $fh, $utf8_data, -s $utf8_file; >close $fh; > >BEGIN { plan tests => 2 ; } >ok(encode('euc-jp', $utf8_data), $euc_data); >ok(decode('euc-jp', $euc_data), $utf8_data); >__END__ > > Will it work? NO! It will fail like this. > >> not ok 1 >> # Test 1 got: <UNDEF> (t/classic.pl at line 24) >> # Expected: '0x0020: >> ... > > You fed pre-certified data and still fails. What's wrong? > The answer is: $utf8 is no utf8 UNLESS YOU EXPLICITLY SPECIFY >SOMEWHERE! Yes - things are sequences of iso-8859-1 until told otherwise. >insert > > Encode::_utf8_on($utf8_data); > > before ok() and now it works. You can also make it work by replacing > > open my $fh, $utf8_file or die "$utf8_file:$!"; > > to > > open my $fh, "<:utf8" $utf8_file or die "$utf8_file:$!"; Which is the prefered way. > > Encoding engines themselves appears ok. > I repeat. ENCODING ENGINES THEMSELVES APPEARS OK! I think we knew that ;-) > Am I dumb to take so long to find this out? Maybe. But code >obfuscation, misleading error message and erroneous document is >definitely also to blame. > If encode() demands an SV explicitly marked as UTF8, it should carp >BEFORE it attempts to encode from the first place. It doesn't. If it is not marked as UTF-8 it assumes it isn't. So (Jarkko's locale stuff aside) it is a sequence of iso-8859-1 chars for legacy compatibility. You then ask it to convert those bytes to EUC-JP and lots of high-bit iso-8859-1's (which is what UTF8 encoded stuff looks like) don't map so you get undefs. Back to locale ... The idea of the locale stuff is to say "aha - user is in a Japanese locale so in absence of instructions to the contrary I will assume that files are full of iso2022-jp encoded stuff" (or whatever is right thing). So you will still need to explicitly tell it when you are breaking that assumption. > I also found croaking in (en|de)code is problematic in such occasion >that you need to determine encodings dynamically. With this in mind, I >made changes to encode() and decode() as follows; > >sub encode >{ > my ($name,$string,$check) = @_; > my $enc = find_encoding($name); > unless (defined $enc){ > # Maybe we should set $Encode::$! or something instead.... > # or should we cast _utf8_on()? > carp("Unknown encoding '$name'"); > return; > } > unless (is_utf8($string)){ > $check += 0; # numify when empty > carp("¥$string is not UTF-8: encode('$name', ¥$string, $check)"); I assume that ESC sequences are iso2022 - this is also "the wrong thing". Eventually carp is going to write to STDERR stream at it may "know" that STDERR is iso2022 and do the right thing. > return; > } > my $octets = $enc->encode($string,$check); > return undef if ($check && length($string)); > return $octets; >} > >sub decode >{ > my ($name,$octets,$check) = @_; > my $enc = find_encoding($name); > unless(defined $enc){ > carp("Unknown encoding '$name'"); > return; > } > my $string = $enc->decode($octets,$check); > $_[1] = $octets if $check; > return $string; >} > > There are other places where croak() that should carp() but I'll wait >next breadperl to commit these changes. The idea of the croak is you can catch it silently with eval { $string = decode($trythis,... } (or better yet call find_encoding yourself before getting that far). The carp is going to leak out to the user and look messy. > So much as I feel relieved now, I still feel uncomfortable on the API >of Encode. UTF8 flag must be explicitly set yet the use of _utf8_on() >is depreciated. Yes you are supposed to set it on the file handle. Setting it on may be appropriate if data comes in magically from somewhere else. >I am looking for a more elegant way to handle this.... > >Dan the Man with too Many Charsets to Handle. -- Nick Ing-Simmons http://www.ni-s.u-net.com/