Paul Bijnens <[EMAIL PROTECTED]> writes: >Can anyone explain what I'm doing wrong?
As I recall HTML::Entities has a build-time option as to whether it handles Unicode - do you know if yours has that turned on? What locale are you in (i.e. is it something that has â as a native 8-bit coding (Windows 1251 or iso-8859-15 say)? >I have this recurring problem of strings not being flagged >as utf8, when -- I believe -- they should be. > >One of those cases is in decode_entities() from the module >HTML::Entities, but I have other occurances too (e.g. in Plucene). > >When I run this program: > >########### cut here >#!/usr/bin/perl >use HTML::Entities; >use Encode; >print "This is perl ", $], "\n"; > >$s = "€"; >$t = decode_entities($s); >$u = decode("utf8", $t, Decode::FB_CROAK); > >print "t: ", Encode::is_utf8($t) ? "is" : "not", " utf8", "\n"; >print "u: ", Encode::is_utf8($u) ? "is" : "not", " utf8", "\n"; >print "t: ", ($t eq "\x{20ac}") ? "is" : "not", " Eurosign\n"; >print "u: ", ($u eq "\x{20ac}") ? "is" : "not", " Eurosign\n"; >########### cut here > >I get this output: > >This is perl 5.008005 >t: not utf8 >u: is utf8 >t: not Eurosign >u: is Eurosign > >I would expect that $t does have the utf8 flag set, >as indicated in the manpage of HTML::Entities : > > decode_entities( $string ) > This routine replaces HTML entities found in the > $string with the corresponding ISO-8859-1 character, > and if possible (under perl 5.8 or later) will replace > to Unicode characters. Unrecognized entities are left > alone. > >Why do I have to force the utf8 flag using decode("utf8",..) ? Well that does suggest what you expect I agree. > >One of my guesses is that the problem lies in XS-processing of strings >where the utf8 flag is not set correctly. True? Certainly possible - suggest you contact author of HTML:Entities It is also possible it is left encoded deliberately. >Why does nobody else complain then? > >Is my setup wrong? (Tried this on different installations including >a brand new Fedore Core 3...) > > >-- >Paul Bijnens, Xplanation Tel +32 16 397.511 >Technologielaan 21 bus 2, B-3001 Leuven, BELGIUM Fax +32 16 397.512 >http://www.xplanation.com/ email: [EMAIL PROTECTED] >*********************************************************************** >* I think I've got the hang of it now: exit, ^D, ^C, ^\, ^Z, ^Q, F6, * >* quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, * >* stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt, abort, hangup, * >* PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e, kill -1 $$, shutdown, * >* kill -9 1, Alt-F4, Ctrl-Alt-Del, AltGr-NumLock, Stop-A, ... * >* ... "Are you sure?" ... YES ... Phew ... I'm out * >***********************************************************************