Hi, Vangelis =) try
perl -e 'use open qw/:std :utf8/; use Encode; use Data::Dumper; use HTML::Entities; $str = " "; HTML::Entities::decode_entities( $str ); print Dumper($str)' perl -e 'use Encode; use Devel::Peek; use HTML::Entities; $str = " "; HTML::Entities::decode_entities( $str ); print Dump($str)' 2013/1/8 Vangelis Katsikaros <ibo...@yahoo.gr>: > Hi > > First many thanks for all the familly of LWP, HTML excellent modules and the > work invested on them. > > > > My question concerns the decode_entities, unicode and *some* HTML entities > (the ones in the range 128-255 chr() range) > > The manual says for decode_entities "This routine replaces HTML entities > found in the $string with the corresponding Unicode character" > > So I was expecting that if I decode the nbsp entity I would get the U+00A0 > character (in perl \x{A0}) > > > > I do: > ================================================================ > perl -e 'use Encode; use Data::Dumper; use HTML::Entities; $str = " "; > HTML::Entities::decode_entities( $str ); print Dumper($str)' > > $VAR1 = '�'; > ================================================================ > I see on my terminal the replacement character - black diamond with question > mark, whereas I would expect to see sth like : > $VAR1 = "\x{a0}"; > > > > > If I do the same with the euro enity: > ================================================================ > perl -e 'use Encode; use Data::Dumper; use HTML::Entities; $str = "€"; > HTML::Entities::decode_entities( $str ); print Dumper($str)' > > $VAR1 = "\x{20ac}"; > ================================================================ > I do get the expected result (the perl U+20AC unicode character) > > > > > > Trying to dig a bit more I noticed the following: > ================================================================ > $ perl -e 'use HTML::Entities; $str = " "; > HTML::Entities::decode_entities( $str ); print $str' | hexdump -C > 00000000 a0 |.| > 00000001 > > perl -e 'use HTML::Entities; $str = "€"; > HTML::Entities::decode_entities( $str ); print $str' | hexdump -C > Wide character in print at -e line 1. > 00000000 e2 82 ac |...| > 00000003 > > perl -e 'use Encode; use HTML::Entities; $str = "€"; > HTML::Entities::decode_entities( $str ); $t = Encode::encode("UTF-8",$str); > print $t' | hexdump -C > 00000000 e2 82 ac |...| > 00000003 > ================================================================ > > In the nbsp case I get the byte 'a0' whereas I would expect the bytes 'c2 > a0' (for utf-8). > > In the 1st euro case I do get the bytes 'e2 82 ac' that are the proper bytes > for U+20AC in utf-8. I do get a "Wide character in print" warning from > print(), because the str isn't encoded properly. > > In the 2nd euro case I get the same bytes (correct U+20AC in utf-8) and no > warn message from print(), since I do encode properly. > > > > > So to rephrase my question: why don't I see "\x{a0}" (in the perl sting), or > 'c2a0' in the bytes streamed, when I decode the nbsp HTML entity? Wouldn't > these be the expected results? > > Regards > Vangelis > > PS Forgive my ignorance if I say sth stupid. I think I do understand some > aspects of unicode handling in perl, but I haven't run out of room for > improvement.