Re: HTML::Entities and unicode

Victor Efimov Tue, 08 Jan 2013 04:01:58 -0800

Hi, Vangelis =)

try


perl -e 'use open qw/:std :utf8/; use Encode; use Data::Dumper; use
HTML::Entities; $str = "&nbsp;"; HTML::Entities::decode_entities( $str
); print Dumper($str)'
perl -e 'use Encode; use Devel::Peek; use HTML::Entities; $str =
"&nbsp;"; HTML::Entities::decode_entities( $str ); print Dump($str)'

2013/1/8 Vangelis Katsikaros <ibo...@yahoo.gr>:
> Hi
>
> First many thanks for all the familly of LWP, HTML excellent modules and the
> work invested on them.
>
>
>
> My question concerns the decode_entities, unicode and *some* HTML entities
> (the ones in the range 128-255 chr() range)
>
> The manual says for decode_entities "This routine replaces HTML entities
> found in the $string with the corresponding Unicode character"
>
> So I was expecting that if I decode the nbsp entity I would get the U+00A0
> character (in perl \x{A0})
>
>
>
> I do:
> ================================================================
> perl -e 'use Encode; use Data::Dumper; use HTML::Entities; $str = "&nbsp;";
> HTML::Entities::decode_entities( $str ); print Dumper($str)'
>
> $VAR1 = '�';
> ================================================================
> I see on my terminal the replacement character - black diamond with question
> mark, whereas I would expect to see sth like :
> $VAR1 = "\x{a0}";
>
>
>
>
> If I do the same with the euro enity:
> ================================================================
> perl -e 'use Encode; use Data::Dumper; use HTML::Entities; $str = "&euro;";
> HTML::Entities::decode_entities( $str ); print Dumper($str)'
>
> $VAR1 = "\x{20ac}";
> ================================================================
> I do get the expected result (the perl U+20AC unicode character)
>
>
>
>
>
> Trying to dig a bit more I noticed the following:
> ================================================================
> $ perl -e 'use HTML::Entities; $str = "&nbsp;";
> HTML::Entities::decode_entities( $str ); print $str' | hexdump -C
> 00000000  a0                                                |.|
> 00000001
>
> perl -e 'use HTML::Entities; $str = "&euro;";
> HTML::Entities::decode_entities( $str ); print $str' | hexdump -C
> Wide character in print at -e line 1.
> 00000000  e2 82 ac                                          |...|
> 00000003
>
> perl -e 'use Encode; use HTML::Entities; $str = "&euro;";
> HTML::Entities::decode_entities( $str ); $t = Encode::encode("UTF-8",$str);
> print $t' | hexdump -C
> 00000000  e2 82 ac                                          |...|
> 00000003
> ================================================================
>
> In the nbsp case I get the byte 'a0' whereas I would expect the bytes 'c2
> a0' (for utf-8).
>
> In the 1st euro case I do get the bytes 'e2 82 ac' that are the proper bytes
> for U+20AC in utf-8. I do get a "Wide character in print" warning from
> print(), because the str isn't encoded properly.
>
> In the 2nd euro case I get the same bytes (correct U+20AC in utf-8) and no
> warn message from print(), since I do encode properly.
>
>
>
>
> So to rephrase my question: why don't I see "\x{a0}" (in the perl sting), or
> 'c2a0' in the bytes streamed, when I decode the nbsp HTML entity? Wouldn't
> these be the expected results?
>
> Regards
> Vangelis
>
> PS Forgive my ignorance if I say sth stupid. I think I do understand some
> aspects of unicode handling in perl, but I haven't run out of room for
> improvement.

Re: HTML::Entities and unicode

Reply via email to