Hi Victor :)
Yes this is definetely needed if I want to "see" the character in my
console properly. However, I am looking at the bytes too.
Indeed the Devel::Peek is a much better alternative so see things
propelry, thanks!
================================================================
$ perl -e 'use Encode; use Devel::Peek; use HTML::Entities; $str =
" "; HTML::Entities::decode_entities( $str ); print Dump($str)'
SV = PV(0x13b5b78) at 0x13dc920
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x13d71d0 "\240"\0
CUR = 1
LEN = 8
$ perl -e 'use Encode; use Devel::Peek; use HTML::Entities; $str =
"€"; HTML::Entities::decode_entities( $str ); print Dump($str)'
SV = PV(0x165fb78) at 0x1686920
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x16811d0 "\342\202\254"\0 [UTF8 "\x{20ac}"]
CUR = 3
LEN = 8
================================================================
For the euro entity I see
"\342\202\254"\0 [UTF8 "\x{20ac}"]
but for the nbsp entity I see
"\240"\0
No [UTF8 "\x{a0}"]
Let me explain why I do expect U+00A0:
http://www.w3.org/TR/html4/sgml/entities.html
<quote>
<!ENTITY nbsp CDATA " " -- no-break space = non-breaking space,
U+00A0 ISOnum -->
</quote>
Regards
Vangelis
On 01/08/2013 02:01 PM, Victor Efimov wrote:
Hi, Vangelis =)
try
perl -e 'use open qw/:std :utf8/; use Encode; use Data::Dumper; use
HTML::Entities; $str = " "; HTML::Entities::decode_entities( $str
); print Dumper($str)'
perl -e 'use Encode; use Devel::Peek; use HTML::Entities; $str =
" "; HTML::Entities::decode_entities( $str ); print Dump($str)'
2013/1/8 Vangelis Katsikaros <ibo...@yahoo.gr>:
Hi
First many thanks for all the familly of LWP, HTML excellent modules and the
work invested on them.
My question concerns the decode_entities, unicode and *some* HTML entities
(the ones in the range 128-255 chr() range)
The manual says for decode_entities "This routine replaces HTML entities
found in the $string with the corresponding Unicode character"
So I was expecting that if I decode the nbsp entity I would get the U+00A0
character (in perl \x{A0})
I do:
================================================================
perl -e 'use Encode; use Data::Dumper; use HTML::Entities; $str = " ";
HTML::Entities::decode_entities( $str ); print Dumper($str)'
$VAR1 = '�';
================================================================
I see on my terminal the replacement character - black diamond with question
mark, whereas I would expect to see sth like :
$VAR1 = "\x{a0}";
If I do the same with the euro enity:
================================================================
perl -e 'use Encode; use Data::Dumper; use HTML::Entities; $str = "€";
HTML::Entities::decode_entities( $str ); print Dumper($str)'
$VAR1 = "\x{20ac}";
================================================================
I do get the expected result (the perl U+20AC unicode character)
Trying to dig a bit more I noticed the following:
================================================================
$ perl -e 'use HTML::Entities; $str = " ";
HTML::Entities::decode_entities( $str ); print $str' | hexdump -C
00000000 a0 |.|
00000001
perl -e 'use HTML::Entities; $str = "€";
HTML::Entities::decode_entities( $str ); print $str' | hexdump -C
Wide character in print at -e line 1.
00000000 e2 82 ac |...|
00000003
perl -e 'use Encode; use HTML::Entities; $str = "€";
HTML::Entities::decode_entities( $str ); $t = Encode::encode("UTF-8",$str);
print $t' | hexdump -C
00000000 e2 82 ac |...|
00000003
================================================================
In the nbsp case I get the byte 'a0' whereas I would expect the bytes 'c2
a0' (for utf-8).
In the 1st euro case I do get the bytes 'e2 82 ac' that are the proper bytes
for U+20AC in utf-8. I do get a "Wide character in print" warning from
print(), because the str isn't encoded properly.
In the 2nd euro case I get the same bytes (correct U+20AC in utf-8) and no
warn message from print(), since I do encode properly.
So to rephrase my question: why don't I see "\x{a0}" (in the perl sting), or
'c2a0' in the bytes streamed, when I decode the nbsp HTML entity? Wouldn't
these be the expected results?
Regards
Vangelis
PS Forgive my ignorance if I say sth stupid. I think I do understand some
aspects of unicode handling in perl, but I haven't run out of room for
improvement.