Re: HTML::Entities and unicode

Victor Efimov Tue, 08 Jan 2013 04:36:47 -0800

So, sometimes it returns correct UTF-8 character string

perl -e 'use open qw/:std :utf8/; use Encode; use Devel::Peek; use
HTML::Entities; $str = "&euro;&nbsp;";
HTML::Entities::decode_entities( $str ); print Dump($str)'
SV = PV(0xd67b78) at 0xd95220
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0xd85b60 "\342\202\254\302\240"\0 [UTF8 "\x{20ac}\x{a0}"]
  CUR = 5
  LEN = 16


Sometimes ISO-8859-1 BYTE string

perl -e 'use open qw/:std :utf8/; use Encode; use Devel::Peek; use
HTML::Entities; $str = "&nbsp;"; HTML::Entities::decode_entities( $str
); print Dump($str)'
SV = PV(0x12fcb78) at 0x132a200
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x131ab50 "\240"\0
  CUR = 1
  LEN = 8


I think there is corresponding bug
https://rt.cpan.org/Public/Bug/Display.html?id=73751



2013/1/8 Vangelis Katsikaros <ibo...@yahoo.gr>:
> Hi Victor :)
>
> Yes this is definetely needed if I want to "see" the character in my console
> properly. However, I am looking at the bytes too.
>
> Indeed the Devel::Peek is a much better alternative so see things propelry,
> thanks!
>
> ================================================================
> $ perl -e 'use Encode; use Devel::Peek; use HTML::Entities; $str = "&nbsp;";
> HTML::Entities::decode_entities( $str ); print Dump($str)'
> SV = PV(0x13b5b78) at 0x13dc920
>   REFCNT = 1
>   FLAGS = (POK,pPOK)
>   PV = 0x13d71d0 "\240"\0
>   CUR = 1
>   LEN = 8
>
> $ perl -e 'use Encode; use Devel::Peek; use HTML::Entities; $str = "&euro;";
> HTML::Entities::decode_entities( $str ); print Dump($str)'
> SV = PV(0x165fb78) at 0x1686920
>   REFCNT = 1
>   FLAGS = (POK,pPOK,UTF8)
>   PV = 0x16811d0 "\342\202\254"\0 [UTF8 "\x{20ac}"]
>   CUR = 3
>   LEN = 8
> ================================================================
>
> For the euro entity I see
> "\342\202\254"\0 [UTF8 "\x{20ac}"]
>
> but for the nbsp entity I see
> "\240"\0
>
> No [UTF8 "\x{a0}"]
>
>
>
> Let me explain why I do expect U+00A0:
>
> http://www.w3.org/TR/html4/sgml/entities.html
>
> <quote>
> <!ENTITY nbsp   CDATA "&#160;" -- no-break space = non-breaking space,
>                                   U+00A0 ISOnum -->
> </quote>
>
>
> Regards
> Vangelis
>
>
> On 01/08/2013 02:01 PM, Victor Efimov wrote:
>>
>> Hi, Vangelis =)
>>
>> try
>>
>> perl -e 'use open qw/:std :utf8/; use Encode; use Data::Dumper; use
>> HTML::Entities; $str = "&nbsp;"; HTML::Entities::decode_entities( $str
>> ); print Dumper($str)'
>> perl -e 'use Encode; use Devel::Peek; use HTML::Entities; $str =
>> "&nbsp;"; HTML::Entities::decode_entities( $str ); print Dump($str)'
>>
>> 2013/1/8 Vangelis Katsikaros <ibo...@yahoo.gr>:
>>>
>>> Hi
>>>
>>> First many thanks for all the familly of LWP, HTML excellent modules and
>>> the
>>> work invested on them.
>>>
>>>
>>>
>>> My question concerns the decode_entities, unicode and *some* HTML
>>> entities
>>> (the ones in the range 128-255 chr() range)
>>>
>>> The manual says for decode_entities "This routine replaces HTML entities
>>> found in the $string with the corresponding Unicode character"
>>>
>>> So I was expecting that if I decode the nbsp entity I would get the
>>> U+00A0
>>> character (in perl \x{A0})
>>>
>>>
>>>
>>> I do:
>>> ================================================================
>>> perl -e 'use Encode; use Data::Dumper; use HTML::Entities; $str =
>>> "&nbsp;";
>>> HTML::Entities::decode_entities( $str ); print Dumper($str)'
>>>
>>> $VAR1 = '�';
>>> ================================================================
>>> I see on my terminal the replacement character - black diamond with
>>> question
>>> mark, whereas I would expect to see sth like :
>>> $VAR1 = "\x{a0}";
>>>
>>>
>>>
>>>
>>> If I do the same with the euro enity:
>>> ================================================================
>>> perl -e 'use Encode; use Data::Dumper; use HTML::Entities; $str =
>>> "&euro;";
>>> HTML::Entities::decode_entities( $str ); print Dumper($str)'
>>>
>>> $VAR1 = "\x{20ac}";
>>> ================================================================
>>> I do get the expected result (the perl U+20AC unicode character)
>>>
>>>
>>>
>>>
>>>
>>> Trying to dig a bit more I noticed the following:
>>> ================================================================
>>> $ perl -e 'use HTML::Entities; $str = "&nbsp;";
>>> HTML::Entities::decode_entities( $str ); print $str' | hexdump -C
>>> 00000000  a0                                                |.|
>>> 00000001
>>>
>>> perl -e 'use HTML::Entities; $str = "&euro;";
>>> HTML::Entities::decode_entities( $str ); print $str' | hexdump -C
>>> Wide character in print at -e line 1.
>>> 00000000  e2 82 ac                                          |...|
>>> 00000003
>>>
>>> perl -e 'use Encode; use HTML::Entities; $str = "&euro;";
>>> HTML::Entities::decode_entities( $str ); $t =
>>> Encode::encode("UTF-8",$str);
>>> print $t' | hexdump -C
>>> 00000000  e2 82 ac                                          |...|
>>> 00000003
>>> ================================================================
>>>
>>> In the nbsp case I get the byte 'a0' whereas I would expect the bytes 'c2
>>> a0' (for utf-8).
>>>
>>> In the 1st euro case I do get the bytes 'e2 82 ac' that are the proper
>>> bytes
>>> for U+20AC in utf-8. I do get a "Wide character in print" warning from
>>> print(), because the str isn't encoded properly.
>>>
>>> In the 2nd euro case I get the same bytes (correct U+20AC in utf-8) and
>>> no
>>> warn message from print(), since I do encode properly.
>>>
>>>
>>>
>>>
>>> So to rephrase my question: why don't I see "\x{a0}" (in the perl sting),
>>> or
>>> 'c2a0' in the bytes streamed, when I decode the nbsp HTML entity?
>>> Wouldn't
>>> these be the expected results?
>>>
>>> Regards
>>> Vangelis
>>>
>>> PS Forgive my ignorance if I say sth stupid. I think I do understand some
>>> aspects of unicode handling in perl, but I haven't run out of room for
>>> improvement.
>>
>>
>

Re: HTML::Entities and unicode

Reply via email to