So, sometimes it returns correct UTF-8 character string perl -e 'use open qw/:std :utf8/; use Encode; use Devel::Peek; use HTML::Entities; $str = "€ "; HTML::Entities::decode_entities( $str ); print Dump($str)' SV = PV(0xd67b78) at 0xd95220 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0xd85b60 "\342\202\254\302\240"\0 [UTF8 "\x{20ac}\x{a0}"] CUR = 5 LEN = 16
Sometimes ISO-8859-1 BYTE string perl -e 'use open qw/:std :utf8/; use Encode; use Devel::Peek; use HTML::Entities; $str = " "; HTML::Entities::decode_entities( $str ); print Dump($str)' SV = PV(0x12fcb78) at 0x132a200 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x131ab50 "\240"\0 CUR = 1 LEN = 8 I think there is corresponding bug https://rt.cpan.org/Public/Bug/Display.html?id=73751 2013/1/8 Vangelis Katsikaros <ibo...@yahoo.gr>: > Hi Victor :) > > Yes this is definetely needed if I want to "see" the character in my console > properly. However, I am looking at the bytes too. > > Indeed the Devel::Peek is a much better alternative so see things propelry, > thanks! > > ================================================================ > $ perl -e 'use Encode; use Devel::Peek; use HTML::Entities; $str = " "; > HTML::Entities::decode_entities( $str ); print Dump($str)' > SV = PV(0x13b5b78) at 0x13dc920 > REFCNT = 1 > FLAGS = (POK,pPOK) > PV = 0x13d71d0 "\240"\0 > CUR = 1 > LEN = 8 > > $ perl -e 'use Encode; use Devel::Peek; use HTML::Entities; $str = "€"; > HTML::Entities::decode_entities( $str ); print Dump($str)' > SV = PV(0x165fb78) at 0x1686920 > REFCNT = 1 > FLAGS = (POK,pPOK,UTF8) > PV = 0x16811d0 "\342\202\254"\0 [UTF8 "\x{20ac}"] > CUR = 3 > LEN = 8 > ================================================================ > > For the euro entity I see > "\342\202\254"\0 [UTF8 "\x{20ac}"] > > but for the nbsp entity I see > "\240"\0 > > No [UTF8 "\x{a0}"] > > > > Let me explain why I do expect U+00A0: > > http://www.w3.org/TR/html4/sgml/entities.html > > <quote> > <!ENTITY nbsp CDATA " " -- no-break space = non-breaking space, > U+00A0 ISOnum --> > </quote> > > > Regards > Vangelis > > > On 01/08/2013 02:01 PM, Victor Efimov wrote: >> >> Hi, Vangelis =) >> >> try >> >> perl -e 'use open qw/:std :utf8/; use Encode; use Data::Dumper; use >> HTML::Entities; $str = " "; HTML::Entities::decode_entities( $str >> ); print Dumper($str)' >> perl -e 'use Encode; use Devel::Peek; use HTML::Entities; $str = >> " "; HTML::Entities::decode_entities( $str ); print Dump($str)' >> >> 2013/1/8 Vangelis Katsikaros <ibo...@yahoo.gr>: >>> >>> Hi >>> >>> First many thanks for all the familly of LWP, HTML excellent modules and >>> the >>> work invested on them. >>> >>> >>> >>> My question concerns the decode_entities, unicode and *some* HTML >>> entities >>> (the ones in the range 128-255 chr() range) >>> >>> The manual says for decode_entities "This routine replaces HTML entities >>> found in the $string with the corresponding Unicode character" >>> >>> So I was expecting that if I decode the nbsp entity I would get the >>> U+00A0 >>> character (in perl \x{A0}) >>> >>> >>> >>> I do: >>> ================================================================ >>> perl -e 'use Encode; use Data::Dumper; use HTML::Entities; $str = >>> " "; >>> HTML::Entities::decode_entities( $str ); print Dumper($str)' >>> >>> $VAR1 = '�'; >>> ================================================================ >>> I see on my terminal the replacement character - black diamond with >>> question >>> mark, whereas I would expect to see sth like : >>> $VAR1 = "\x{a0}"; >>> >>> >>> >>> >>> If I do the same with the euro enity: >>> ================================================================ >>> perl -e 'use Encode; use Data::Dumper; use HTML::Entities; $str = >>> "€"; >>> HTML::Entities::decode_entities( $str ); print Dumper($str)' >>> >>> $VAR1 = "\x{20ac}"; >>> ================================================================ >>> I do get the expected result (the perl U+20AC unicode character) >>> >>> >>> >>> >>> >>> Trying to dig a bit more I noticed the following: >>> ================================================================ >>> $ perl -e 'use HTML::Entities; $str = " "; >>> HTML::Entities::decode_entities( $str ); print $str' | hexdump -C >>> 00000000 a0 |.| >>> 00000001 >>> >>> perl -e 'use HTML::Entities; $str = "€"; >>> HTML::Entities::decode_entities( $str ); print $str' | hexdump -C >>> Wide character in print at -e line 1. >>> 00000000 e2 82 ac |...| >>> 00000003 >>> >>> perl -e 'use Encode; use HTML::Entities; $str = "€"; >>> HTML::Entities::decode_entities( $str ); $t = >>> Encode::encode("UTF-8",$str); >>> print $t' | hexdump -C >>> 00000000 e2 82 ac |...| >>> 00000003 >>> ================================================================ >>> >>> In the nbsp case I get the byte 'a0' whereas I would expect the bytes 'c2 >>> a0' (for utf-8). >>> >>> In the 1st euro case I do get the bytes 'e2 82 ac' that are the proper >>> bytes >>> for U+20AC in utf-8. I do get a "Wide character in print" warning from >>> print(), because the str isn't encoded properly. >>> >>> In the 2nd euro case I get the same bytes (correct U+20AC in utf-8) and >>> no >>> warn message from print(), since I do encode properly. >>> >>> >>> >>> >>> So to rephrase my question: why don't I see "\x{a0}" (in the perl sting), >>> or >>> 'c2a0' in the bytes streamed, when I decode the nbsp HTML entity? >>> Wouldn't >>> these be the expected results? >>> >>> Regards >>> Vangelis >>> >>> PS Forgive my ignorance if I say sth stupid. I think I do understand some >>> aspects of unicode handling in perl, but I haven't run out of room for >>> improvement. >> >> >