Re: HTML::Entities and unicode

Victor Efimov Tue, 08 Jan 2013 06:17:01 -0800

Hm, seems my previous comment was wrong.

$ perl -e 'use Devel::Peek; use HTML::Entities; $str = "&nbsp;";
HTML::Entities::decode_entities( $str); print Dump($str)'
SV = PV(0xc7fb78) at 0xca35b0
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0xc9db30 "\240"\0
  CUR = 1
  LEN = 8


(bytes string, ISO-8859-1, correct)

$ perl -e 'use Devel::Peek; use HTML::Entities; $str = "&euro;";
HTML::Entities::decode_entities( $str ); print Dump($str)'
SV = PV(0x112bb78) at 0x114f5b0
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x1149b30 "\342\202\254"\0 [UTF8 "\x{20ac}"]
  CUR = 3
  LEN = 8

(UTF-8, correct)

$ perl -e 'use Encode; use Devel::Peek; use HTML::Entities; $str1 =
"&nbsp;"; HTML::Entities::decode_entities( $str1);  $str2 = "&euro;";
HTML::Entities::decode_entities( $str2); print Dump($str1.$str2)'
SV = PV(0x12a6b58) at 0x12ca588
  REFCNT = 1
  FLAGS = (PADTMP,POK,pPOK,UTF8)
  PV = 0x12d3eb0 "\302\240\342\202\254"\0 [UTF8 "\x{a0}\x{20ac}"]
  CUR = 5
  LEN = 8

(UTF-8 correct)

$ perl -e 'use Devel::Peek; use HTML::Entities; $str = "&nbsp;&euro;";
HTML::Entities::decode_entities( $str ); print Dump($str)'
SV = PV(0x1ccbb78) at 0x1cef5b0
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x1ce9b30 "\302\240\342\202\254"\0 [UTF8 "\x{a0}\x{20ac}"]
  CUR = 5
  LEN = 16

(UTF-8, correct)


It looks correct,
as if we concatenate character string with wide characters and byte
string, byte string treated as ISO-8859-1

> http://perldoc.perl.org/perlunifaq.html
> What if I don't decode?
> Whenever your encoded, binary string is used together with a text string, 
> Perl will assume that your binary string was encoded with ISO-8859-1, also 
> known as latin-1


so seems internal representation of character/bytes is correct in all
cases and compatible with text processing.
if you still want it to be same string internally (to
workaroundhttp://perldoc.perl.org/perlunicode.html#The-%22Unicode-Bug%22)
you can utf8::upgrade

$ perl -e 'use utf8; use Devel::Peek; use HTML::Entities; $str =
"&nbsp;"; HTML::Entities::decode_entities( $str); print Dump($str)'
SV = PV(0xc00b78) at 0xc245d8
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0xc1eb40 "\240"\0
  CUR = 1
  LEN = 8
$ perl -e 'use utf8; use Devel::Peek; use HTML::Entities; $str =
"&nbsp;"; HTML::Entities::decode_entities( $str); utf8::upgrade($str);
print Dump($str)'
SV = PV(0x15cbb78) at 0x15ef5e8
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x15e5ad0 "\302\240"\0 [UTF8 "\x{a0}"]
  CUR = 2
  LEN = 3


2013/1/8 Vangelis Katsikaros <ibo...@yahoo.gr>:
> Hi again
>
> I checked the Entities.pm: line 230 and 223
>
> The entities has is populated by chr. In the chr() 128-255 range something
> doesn't seem to work well:
>
> For the uuml entity (U+00FC):
> ===================================================
> perl -e 'use Devel::Peek; $t = chr(252); Dump($t)'
> SV = PV(0x12b9b78) at 0x12e08a0
>
>   REFCNT = 1
>   FLAGS = (POK,pPOK)
>   PV = 0x12db180 "\374"\0
>
>   CUR = 1
>   LEN = 8
>
> perl -e 'use Devel::Peek; $t = "\x{fc}"; Dump($t)'
> SV = PV(0xf01b78) at 0xf288a0
>
>   REFCNT = 1
>   FLAGS = (POK,pPOK)
>   PV = 0xf23180 "\374"\0
>
>   CUR = 1
>   LEN = 8
> ===================================================
>
>
>
> whereas for the euro sign (U+20AC), all work as expected.
> ===================================================
> perl -e 'use Devel::Peek; $t = chr(8364); Dump($t)'
> SV = PV(0x681b78) at 0x6a88a0
>
>   REFCNT = 1
>   FLAGS = (POK,pPOK,UTF8)
>   PV = 0x6a3180 "\342\202\254"\0 [UTF8 "\x{20ac}"]
>
>   CUR = 3
>   LEN = 8
>
> perl -e 'use Devel::Peek; $t = "\x{20ac}"; Dump($t)'
> SV = PV(0x722b78) at 0x7498a0
>
>   REFCNT = 1
>   FLAGS = (POK,pPOK,UTF8)
>   PV = 0x744180 "\342\202\254"\0 [UTF8 "\x{20ac}"]
>
>   CUR = 3
>   LEN = 8
> ===================================================
>
>
> It seems this is related to the perl's "Unicode Bug":
> http://perldoc.perl.org/perlunicode.html#The-%22Unicode-Bug%22 ?
>
> However, I don't see chr() mentioned there in the affected methods.
>
>
>
> Suggestion:
>
> If the above make sense, would it also make sense to change all the
> instances of chr() to Encode::decode(), at least the ones in the ASCII
> range?
>
> -  'uuml;'     => chr(252),
> +  'uuml;'     => Encode::decode("UTF-8", "\303\274"), # with octecs
> or
> +  'uuml;'     => Encode::decode("UTF-8", "ü"),        # with bytes if the
> pm file is saved utf8. Messy as it would have unexpected results in some
> systems?
>
>
>
> ===================================================
> perl -e 'use Encode; use Devel::Peek; $t = "ü"; $t = Encode::decode("UTF-8",
> $t); Dump($t)'
> SV = PV(0x1536b78) at 0x155d8d8
>
>   REFCNT = 1
>   FLAGS = (POK,pPOK,UTF8)
>   PV = 0x1565af0 "\303\274"\0 [UTF8 "\x{fc}"]
>   CUR = 2
>   LEN = 8
>
> perl -e 'use Encode; use Devel::Peek; $t = "\303\274"; $t =
> Encode::decode("UTF-8", $t); Dump($t)'
> SV = PV(0x1205b78) at 0x122c8e8
>
>   REFCNT = 1
>   FLAGS = (POK,pPOK,UTF8)
>   PV = 0x1234b00 "\303\274"\0 [UTF8 "\x{fc}"]
>   CUR = 2
>   LEN = 8
> ===================================================
>
> Regards
> Vangelis
>
>
>
>
> On 01/08/2013 02:36 PM, Victor Efimov wrote:
>>
>> So, sometimes it returns correct UTF-8 character string
>>
>> perl -e 'use open qw/:std :utf8/; use Encode; use Devel::Peek; use
>> HTML::Entities; $str = "&euro;&nbsp;";
>> HTML::Entities::decode_entities( $str ); print Dump($str)'
>> SV = PV(0xd67b78) at 0xd95220
>>    REFCNT = 1
>>    FLAGS = (POK,pPOK,UTF8)
>>    PV = 0xd85b60 "\342\202\254\302\240"\0 [UTF8 "\x{20ac}\x{a0}"]
>>    CUR = 5
>>    LEN = 16
>>
>> Sometimes ISO-8859-1 BYTE string
>>
>> perl -e 'use open qw/:std :utf8/; use Encode; use Devel::Peek; use
>> HTML::Entities; $str = "&nbsp;"; HTML::Entities::decode_entities( $str
>> ); print Dump($str)'
>> SV = PV(0x12fcb78) at 0x132a200
>>    REFCNT = 1
>>    FLAGS = (POK,pPOK)
>>    PV = 0x131ab50 "\240"\0
>>    CUR = 1
>>    LEN = 8
>>
>>
>> I think there is corresponding bug
>> https://rt.cpan.org/Public/Bug/Display.html?id=73751
>>
>>
>>
>> 2013/1/8 Vangelis Katsikaros <ibo...@yahoo.gr>:
>>>
>>> Hi Victor :)
>>>
>>> Yes this is definetely needed if I want to "see" the character in my
>>> console
>>> properly. However, I am looking at the bytes too.
>>>
>>> Indeed the Devel::Peek is a much better alternative so see things
>>> propelry,
>>> thanks!
>>>
>>> ================================================================
>>> $ perl -e 'use Encode; use Devel::Peek; use HTML::Entities; $str =
>>> "&nbsp;";
>>> HTML::Entities::decode_entities( $str ); print Dump($str)'
>>> SV = PV(0x13b5b78) at 0x13dc920
>>>    REFCNT = 1
>>>    FLAGS = (POK,pPOK)
>>>    PV = 0x13d71d0 "\240"\0
>>>    CUR = 1
>>>    LEN = 8
>>>
>>> $ perl -e 'use Encode; use Devel::Peek; use HTML::Entities; $str =
>>> "&euro;";
>>> HTML::Entities::decode_entities( $str ); print Dump($str)'
>>> SV = PV(0x165fb78) at 0x1686920
>>>    REFCNT = 1
>>>    FLAGS = (POK,pPOK,UTF8)
>>>    PV = 0x16811d0 "\342\202\254"\0 [UTF8 "\x{20ac}"]
>>>    CUR = 3
>>>    LEN = 8
>>> ================================================================
>>>
>>> For the euro entity I see
>>> "\342\202\254"\0 [UTF8 "\x{20ac}"]
>>>
>>> but for the nbsp entity I see
>>> "\240"\0
>>>
>>> No [UTF8 "\x{a0}"]
>>>
>>>
>>>
>>> Let me explain why I do expect U+00A0:
>>>
>>> http://www.w3.org/TR/html4/sgml/entities.html
>>>
>>> <quote>
>>> <!ENTITY nbsp   CDATA "&#160;" -- no-break space = non-breaking space,
>>>                                    U+00A0 ISOnum -->
>>> </quote>
>>>
>>>
>>> Regards
>>> Vangelis
>>>
>>>
>>> On 01/08/2013 02:01 PM, Victor Efimov wrote:
>>>>
>>>>
>>>> Hi, Vangelis =)
>>>>
>>>> try
>>>>
>>>> perl -e 'use open qw/:std :utf8/; use Encode; use Data::Dumper; use
>>>> HTML::Entities; $str = "&nbsp;"; HTML::Entities::decode_entities( $str
>>>> ); print Dumper($str)'
>>>> perl -e 'use Encode; use Devel::Peek; use HTML::Entities; $str =
>>>> "&nbsp;"; HTML::Entities::decode_entities( $str ); print Dump($str)'
>>>>
>>>> 2013/1/8 Vangelis Katsikaros <ibo...@yahoo.gr>:
>>>>>
>>>>>
>>>>> Hi
>>>>>
>>>>> First many thanks for all the familly of LWP, HTML excellent modules
>>>>> and
>>>>> the
>>>>> work invested on them.
>>>>>
>>>>>
>>>>>
>>>>> My question concerns the decode_entities, unicode and *some* HTML
>>>>> entities
>>>>> (the ones in the range 128-255 chr() range)
>>>>>
>>>>> The manual says for decode_entities "This routine replaces HTML
>>>>> entities
>>>>> found in the $string with the corresponding Unicode character"
>>>>>
>>>>> So I was expecting that if I decode the nbsp entity I would get the
>>>>> U+00A0
>>>>> character (in perl \x{A0})
>>>>>
>>>>>
>>>>>
>>>>> I do:
>>>>> ================================================================
>>>>> perl -e 'use Encode; use Data::Dumper; use HTML::Entities; $str =
>>>>> "&nbsp;";
>>>>> HTML::Entities::decode_entities( $str ); print Dumper($str)'
>>>>>
>>>>> $VAR1 = '�';
>>>>> ================================================================
>>>>> I see on my terminal the replacement character - black diamond with
>>>>> question
>>>>> mark, whereas I would expect to see sth like :
>>>>> $VAR1 = "\x{a0}";
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> If I do the same with the euro enity:
>>>>> ================================================================
>>>>> perl -e 'use Encode; use Data::Dumper; use HTML::Entities; $str =
>>>>> "&euro;";
>>>>> HTML::Entities::decode_entities( $str ); print Dumper($str)'
>>>>>
>>>>> $VAR1 = "\x{20ac}";
>>>>> ================================================================
>>>>> I do get the expected result (the perl U+20AC unicode character)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Trying to dig a bit more I noticed the following:
>>>>> ================================================================
>>>>> $ perl -e 'use HTML::Entities; $str = "&nbsp;";
>>>>> HTML::Entities::decode_entities( $str ); print $str' | hexdump -C
>>>>> 00000000  a0                                                |.|
>>>>> 00000001
>>>>>
>>>>> perl -e 'use HTML::Entities; $str = "&euro;";
>>>>> HTML::Entities::decode_entities( $str ); print $str' | hexdump -C
>>>>> Wide character in print at -e line 1.
>>>>> 00000000  e2 82 ac                                          |...|
>>>>> 00000003
>>>>>
>>>>> perl -e 'use Encode; use HTML::Entities; $str = "&euro;";
>>>>> HTML::Entities::decode_entities( $str ); $t =
>>>>> Encode::encode("UTF-8",$str);
>>>>> print $t' | hexdump -C
>>>>> 00000000  e2 82 ac                                          |...|
>>>>> 00000003
>>>>> ================================================================
>>>>>
>>>>> In the nbsp case I get the byte 'a0' whereas I would expect the bytes
>>>>> 'c2
>>>>> a0' (for utf-8).
>>>>>
>>>>> In the 1st euro case I do get the bytes 'e2 82 ac' that are the proper
>>>>> bytes
>>>>> for U+20AC in utf-8. I do get a "Wide character in print" warning from
>>>>> print(), because the str isn't encoded properly.
>>>>>
>>>>> In the 2nd euro case I get the same bytes (correct U+20AC in utf-8) and
>>>>> no
>>>>> warn message from print(), since I do encode properly.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> So to rephrase my question: why don't I see "\x{a0}" (in the perl
>>>>> sting),
>>>>> or
>>>>> 'c2a0' in the bytes streamed, when I decode the nbsp HTML entity?
>>>>> Wouldn't
>>>>> these be the expected results?
>>>>>
>>>>> Regards
>>>>> Vangelis
>>>>>
>>>>> PS Forgive my ignorance if I say sth stupid. I think I do understand
>>>>> some
>>>>> aspects of unicode handling in perl, but I haven't run out of room for
>>>>> improvement.
>>>>
>>>>
>>>>
>>>
>>
>

Re: HTML::Entities and unicode

Reply via email to