Can anyone explain what I'm doing wrong? I have this recurring problem of strings not being flagged as utf8, when -- I believe -- they should be.
One of those cases is in decode_entities() from the module HTML::Entities, but I have other occurances too (e.g. in Plucene).
When I run this program:
########### cut here #!/usr/bin/perl use HTML::Entities; use Encode; print "This is perl ", $], "\n";
$s = "€"; $t = decode_entities($s); $u = decode("utf8", $t, Decode::FB_CROAK);
print "t: ", Encode::is_utf8($t) ? "is" : "not", " utf8", "\n"; print "u: ", Encode::is_utf8($u) ? "is" : "not", " utf8", "\n"; print "t: ", ($t eq "\x{20ac}") ? "is" : "not", " Eurosign\n"; print "u: ", ($u eq "\x{20ac}") ? "is" : "not", " Eurosign\n"; ########### cut here
I get this output:
This is perl 5.008005 t: not utf8 u: is utf8 t: not Eurosign u: is Eurosign
I would expect that $t does have the utf8 flag set, as indicated in the manpage of HTML::Entities :
decode_entities( $string ) This routine replaces HTML entities found in the $string with the corresponding ISO-8859-1 character, and if possible (under perl 5.8 or later) will replace to Unicode characters. Unrecognized entities are left alone.
Why do I have to force the utf8 flag using decode("utf8",..) ?
One of my guesses is that the problem lies in XS-processing of strings where the utf8 flag is not set correctly. True? Why does nobody else complain then?
Is my setup wrong? (Tried this on different installations including a brand new Fedore Core 3...)
-- Paul Bijnens, Xplanation Tel +32 16 397.511 Technologielaan 21 bus 2, B-3001 Leuven, BELGIUM Fax +32 16 397.512 http://www.xplanation.com/ email: [EMAIL PROTECTED] *********************************************************************** * I think I've got the hang of it now: exit, ^D, ^C, ^\, ^Z, ^Q, F6, * * quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, * * stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt, abort, hangup, * * PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e, kill -1 $$, shutdown, * * kill -9 1, Alt-F4, Ctrl-Alt-Del, AltGr-NumLock, Stop-A, ... * * ... "Are you sure?" ... YES ... Phew ... I'm out * ***********************************************************************