Re: HTML::Entities and unicode

Vangelis Katsikaros Tue, 08 Jan 2013 05:31:18 -0800

Hi again

I checked the Entities.pm: line 230 and 223

The entities has is populated by chr. In the chr() 128-255 rangesomething doesn't seem to work well:


For the uuml entity (U+00FC):
===================================================
perl -e 'use Devel::Peek; $t = chr(252); Dump($t)'
SV = PV(0x12b9b78) at 0x12e08a0
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x12db180 "\374"\0
  CUR = 1
  LEN = 8

perl -e 'use Devel::Peek; $t = "\x{fc}"; Dump($t)'
SV = PV(0xf01b78) at 0xf288a0
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0xf23180 "\374"\0
  CUR = 1
  LEN = 8
===================================================



whereas for the euro sign (U+20AC), all work as expected.
===================================================
perl -e 'use Devel::Peek; $t = chr(8364); Dump($t)'
SV = PV(0x681b78) at 0x6a88a0
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x6a3180 "\342\202\254"\0 [UTF8 "\x{20ac}"]
  CUR = 3
  LEN = 8

perl -e 'use Devel::Peek; $t = "\x{20ac}"; Dump($t)'
SV = PV(0x722b78) at 0x7498a0
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x744180 "\342\202\254"\0 [UTF8 "\x{20ac}"]
  CUR = 3
  LEN = 8
===================================================

It seems this is related to the perl's "Unicode Bug":http://perldoc.perl.org/perlunicode.html#The-%22Unicode-Bug%22 ?


However, I don't see chr() mentioned there in the affected methods.



Suggestion:

If the above make sense, would it also make sense to change all theinstances of chr() to Encode::decode(), at least the ones in the ASCIIrange?


-  'uuml;'     => chr(252),
+  'uuml;'     => Encode::decode("UTF-8", "\303\274"), # with octecs
or

+ 'uuml;' => Encode::decode("UTF-8", "ü"), # with bytes ifthe pm file is saved utf8. Messy as it would have unexpected results insome systems?




===================================================

perl -e 'use Encode; use Devel::Peek; $t = "ü"; $t =Encode::decode("UTF-8", $t); Dump($t)'

SV = PV(0x1536b78) at 0x155d8d8
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x1565af0 "\303\274"\0 [UTF8 "\x{fc}"]
  CUR = 2
  LEN = 8

perl -e 'use Encode; use Devel::Peek; $t = "\303\274"; $t =Encode::decode("UTF-8", $t); Dump($t)'

SV = PV(0x1205b78) at 0x122c8e8
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x1234b00 "\303\274"\0 [UTF8 "\x{fc}"]
  CUR = 2
  LEN = 8
===================================================

Regards
Vangelis



On 01/08/2013 02:36 PM, Victor Efimov wrote:

So, sometimes it returns correct UTF-8 character string

perl -e 'use open qw/:std :utf8/; use Encode; use Devel::Peek; use
HTML::Entities; $str = "&euro;&nbsp;";
HTML::Entities::decode_entities( $str ); print Dump($str)'
SV = PV(0xd67b78) at 0xd95220
   REFCNT = 1
   FLAGS = (POK,pPOK,UTF8)
   PV = 0xd85b60 "\342\202\254\302\240"\0 [UTF8 "\x{20ac}\x{a0}"]
   CUR = 5
   LEN = 16

Sometimes ISO-8859-1 BYTE string

perl -e 'use open qw/:std :utf8/; use Encode; use Devel::Peek; use
HTML::Entities; $str = "&nbsp;"; HTML::Entities::decode_entities( $str
); print Dump($str)'
SV = PV(0x12fcb78) at 0x132a200
   REFCNT = 1
   FLAGS = (POK,pPOK)
   PV = 0x131ab50 "\240"\0
   CUR = 1
   LEN = 8


I think there is corresponding bug
https://rt.cpan.org/Public/Bug/Display.html?id=73751



2013/1/8 Vangelis Katsikaros <ibo...@yahoo.gr>:

Hi Victor :)

Yes this is definetely needed if I want to "see" the character in my console
properly. However, I am looking at the bytes too.

Indeed the Devel::Peek is a much better alternative so see things propelry,
thanks!

================================================================
$ perl -e 'use Encode; use Devel::Peek; use HTML::Entities; $str = "&nbsp;";
HTML::Entities::decode_entities( $str ); print Dump($str)'
SV = PV(0x13b5b78) at 0x13dc920
   REFCNT = 1
   FLAGS = (POK,pPOK)
   PV = 0x13d71d0 "\240"\0
   CUR = 1
   LEN = 8

$ perl -e 'use Encode; use Devel::Peek; use HTML::Entities; $str = "&euro;";
HTML::Entities::decode_entities( $str ); print Dump($str)'
SV = PV(0x165fb78) at 0x1686920
   REFCNT = 1
   FLAGS = (POK,pPOK,UTF8)
   PV = 0x16811d0 "\342\202\254"\0 [UTF8 "\x{20ac}"]
   CUR = 3
   LEN = 8
================================================================

For the euro entity I see
"\342\202\254"\0 [UTF8 "\x{20ac}"]

but for the nbsp entity I see
"\240"\0

No [UTF8 "\x{a0}"]



Let me explain why I do expect U+00A0:

http://www.w3.org/TR/html4/sgml/entities.html

<quote>
<!ENTITY nbsp   CDATA "&#160;" -- no-break space = non-breaking space,
                                   U+00A0 ISOnum -->
</quote>


Regards
Vangelis


On 01/08/2013 02:01 PM, Victor Efimov wrote:


Hi, Vangelis =)

try

perl -e 'use open qw/:std :utf8/; use Encode; use Data::Dumper; use
HTML::Entities; $str = "&nbsp;"; HTML::Entities::decode_entities( $str
); print Dumper($str)'
perl -e 'use Encode; use Devel::Peek; use HTML::Entities; $str =
"&nbsp;"; HTML::Entities::decode_entities( $str ); print Dump($str)'

2013/1/8 Vangelis Katsikaros <ibo...@yahoo.gr>:


Hi

First many thanks for all the familly of LWP, HTML excellent modules and
the
work invested on them.



My question concerns the decode_entities, unicode and *some* HTML
entities
(the ones in the range 128-255 chr() range)

The manual says for decode_entities "This routine replaces HTML entities
found in the $string with the corresponding Unicode character"

So I was expecting that if I decode the nbsp entity I would get the
U+00A0
character (in perl \x{A0})



I do:
================================================================
perl -e 'use Encode; use Data::Dumper; use HTML::Entities; $str =
"&nbsp;";
HTML::Entities::decode_entities( $str ); print Dumper($str)'

$VAR1 = '�';
================================================================
I see on my terminal the replacement character - black diamond with
question
mark, whereas I would expect to see sth like :
$VAR1 = "\x{a0}";




If I do the same with the euro enity:
================================================================
perl -e 'use Encode; use Data::Dumper; use HTML::Entities; $str =
"&euro;";
HTML::Entities::decode_entities( $str ); print Dumper($str)'

$VAR1 = "\x{20ac}";
================================================================
I do get the expected result (the perl U+20AC unicode character)





Trying to dig a bit more I noticed the following:
================================================================
$ perl -e 'use HTML::Entities; $str = "&nbsp;";
HTML::Entities::decode_entities( $str ); print $str' | hexdump -C
00000000  a0                                                |.|
00000001

perl -e 'use HTML::Entities; $str = "&euro;";
HTML::Entities::decode_entities( $str ); print $str' | hexdump -C
Wide character in print at -e line 1.
00000000  e2 82 ac                                          |...|
00000003

perl -e 'use Encode; use HTML::Entities; $str = "&euro;";
HTML::Entities::decode_entities( $str ); $t =
Encode::encode("UTF-8",$str);
print $t' | hexdump -C
00000000  e2 82 ac                                          |...|
00000003
================================================================

In the nbsp case I get the byte 'a0' whereas I would expect the bytes 'c2
a0' (for utf-8).

In the 1st euro case I do get the bytes 'e2 82 ac' that are the proper
bytes
for U+20AC in utf-8. I do get a "Wide character in print" warning from
print(), because the str isn't encoded properly.

In the 2nd euro case I get the same bytes (correct U+20AC in utf-8) and
no
warn message from print(), since I do encode properly.




So to rephrase my question: why don't I see "\x{a0}" (in the perl sting),
or
'c2a0' in the bytes streamed, when I decode the nbsp HTML entity?
Wouldn't
these be the expected results?

Regards
Vangelis

PS Forgive my ignorance if I say sth stupid. I think I do understand some
aspects of unicode handling in perl, but I haven't run out of room for
improvement.

Re: HTML::Entities and unicode

Reply via email to