Hm, seems my previous comment was wrong. $ perl -e 'use Devel::Peek; use HTML::Entities; $str = " "; HTML::Entities::decode_entities( $str); print Dump($str)' SV = PV(0xc7fb78) at 0xca35b0 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0xc9db30 "\240"\0 CUR = 1 LEN = 8
(bytes string, ISO-8859-1, correct) $ perl -e 'use Devel::Peek; use HTML::Entities; $str = "€"; HTML::Entities::decode_entities( $str ); print Dump($str)' SV = PV(0x112bb78) at 0x114f5b0 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x1149b30 "\342\202\254"\0 [UTF8 "\x{20ac}"] CUR = 3 LEN = 8 (UTF-8, correct) $ perl -e 'use Encode; use Devel::Peek; use HTML::Entities; $str1 = " "; HTML::Entities::decode_entities( $str1); $str2 = "€"; HTML::Entities::decode_entities( $str2); print Dump($str1.$str2)' SV = PV(0x12a6b58) at 0x12ca588 REFCNT = 1 FLAGS = (PADTMP,POK,pPOK,UTF8) PV = 0x12d3eb0 "\302\240\342\202\254"\0 [UTF8 "\x{a0}\x{20ac}"] CUR = 5 LEN = 8 (UTF-8 correct) $ perl -e 'use Devel::Peek; use HTML::Entities; $str = " €"; HTML::Entities::decode_entities( $str ); print Dump($str)' SV = PV(0x1ccbb78) at 0x1cef5b0 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x1ce9b30 "\302\240\342\202\254"\0 [UTF8 "\x{a0}\x{20ac}"] CUR = 5 LEN = 16 (UTF-8, correct) It looks correct, as if we concatenate character string with wide characters and byte string, byte string treated as ISO-8859-1 > http://perldoc.perl.org/perlunifaq.html > What if I don't decode? > Whenever your encoded, binary string is used together with a text string, > Perl will assume that your binary string was encoded with ISO-8859-1, also > known as latin-1 so seems internal representation of character/bytes is correct in all cases and compatible with text processing. if you still want it to be same string internally (to workaroundhttp://perldoc.perl.org/perlunicode.html#The-%22Unicode-Bug%22) you can utf8::upgrade $ perl -e 'use utf8; use Devel::Peek; use HTML::Entities; $str = " "; HTML::Entities::decode_entities( $str); print Dump($str)' SV = PV(0xc00b78) at 0xc245d8 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0xc1eb40 "\240"\0 CUR = 1 LEN = 8 $ perl -e 'use utf8; use Devel::Peek; use HTML::Entities; $str = " "; HTML::Entities::decode_entities( $str); utf8::upgrade($str); print Dump($str)' SV = PV(0x15cbb78) at 0x15ef5e8 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x15e5ad0 "\302\240"\0 [UTF8 "\x{a0}"] CUR = 2 LEN = 3 2013/1/8 Vangelis Katsikaros <ibo...@yahoo.gr>: > Hi again > > I checked the Entities.pm: line 230 and 223 > > The entities has is populated by chr. In the chr() 128-255 range something > doesn't seem to work well: > > For the uuml entity (U+00FC): > =================================================== > perl -e 'use Devel::Peek; $t = chr(252); Dump($t)' > SV = PV(0x12b9b78) at 0x12e08a0 > > REFCNT = 1 > FLAGS = (POK,pPOK) > PV = 0x12db180 "\374"\0 > > CUR = 1 > LEN = 8 > > perl -e 'use Devel::Peek; $t = "\x{fc}"; Dump($t)' > SV = PV(0xf01b78) at 0xf288a0 > > REFCNT = 1 > FLAGS = (POK,pPOK) > PV = 0xf23180 "\374"\0 > > CUR = 1 > LEN = 8 > =================================================== > > > > whereas for the euro sign (U+20AC), all work as expected. > =================================================== > perl -e 'use Devel::Peek; $t = chr(8364); Dump($t)' > SV = PV(0x681b78) at 0x6a88a0 > > REFCNT = 1 > FLAGS = (POK,pPOK,UTF8) > PV = 0x6a3180 "\342\202\254"\0 [UTF8 "\x{20ac}"] > > CUR = 3 > LEN = 8 > > perl -e 'use Devel::Peek; $t = "\x{20ac}"; Dump($t)' > SV = PV(0x722b78) at 0x7498a0 > > REFCNT = 1 > FLAGS = (POK,pPOK,UTF8) > PV = 0x744180 "\342\202\254"\0 [UTF8 "\x{20ac}"] > > CUR = 3 > LEN = 8 > =================================================== > > > It seems this is related to the perl's "Unicode Bug": > http://perldoc.perl.org/perlunicode.html#The-%22Unicode-Bug%22 ? > > However, I don't see chr() mentioned there in the affected methods. > > > > Suggestion: > > If the above make sense, would it also make sense to change all the > instances of chr() to Encode::decode(), at least the ones in the ASCII > range? > > - 'uuml;' => chr(252), > + 'uuml;' => Encode::decode("UTF-8", "\303\274"), # with octecs > or > + 'uuml;' => Encode::decode("UTF-8", "ü"), # with bytes if the > pm file is saved utf8. Messy as it would have unexpected results in some > systems? > > > > =================================================== > perl -e 'use Encode; use Devel::Peek; $t = "ü"; $t = Encode::decode("UTF-8", > $t); Dump($t)' > SV = PV(0x1536b78) at 0x155d8d8 > > REFCNT = 1 > FLAGS = (POK,pPOK,UTF8) > PV = 0x1565af0 "\303\274"\0 [UTF8 "\x{fc}"] > CUR = 2 > LEN = 8 > > perl -e 'use Encode; use Devel::Peek; $t = "\303\274"; $t = > Encode::decode("UTF-8", $t); Dump($t)' > SV = PV(0x1205b78) at 0x122c8e8 > > REFCNT = 1 > FLAGS = (POK,pPOK,UTF8) > PV = 0x1234b00 "\303\274"\0 [UTF8 "\x{fc}"] > CUR = 2 > LEN = 8 > =================================================== > > Regards > Vangelis > > > > > On 01/08/2013 02:36 PM, Victor Efimov wrote: >> >> So, sometimes it returns correct UTF-8 character string >> >> perl -e 'use open qw/:std :utf8/; use Encode; use Devel::Peek; use >> HTML::Entities; $str = "€ "; >> HTML::Entities::decode_entities( $str ); print Dump($str)' >> SV = PV(0xd67b78) at 0xd95220 >> REFCNT = 1 >> FLAGS = (POK,pPOK,UTF8) >> PV = 0xd85b60 "\342\202\254\302\240"\0 [UTF8 "\x{20ac}\x{a0}"] >> CUR = 5 >> LEN = 16 >> >> Sometimes ISO-8859-1 BYTE string >> >> perl -e 'use open qw/:std :utf8/; use Encode; use Devel::Peek; use >> HTML::Entities; $str = " "; HTML::Entities::decode_entities( $str >> ); print Dump($str)' >> SV = PV(0x12fcb78) at 0x132a200 >> REFCNT = 1 >> FLAGS = (POK,pPOK) >> PV = 0x131ab50 "\240"\0 >> CUR = 1 >> LEN = 8 >> >> >> I think there is corresponding bug >> https://rt.cpan.org/Public/Bug/Display.html?id=73751 >> >> >> >> 2013/1/8 Vangelis Katsikaros <ibo...@yahoo.gr>: >>> >>> Hi Victor :) >>> >>> Yes this is definetely needed if I want to "see" the character in my >>> console >>> properly. However, I am looking at the bytes too. >>> >>> Indeed the Devel::Peek is a much better alternative so see things >>> propelry, >>> thanks! >>> >>> ================================================================ >>> $ perl -e 'use Encode; use Devel::Peek; use HTML::Entities; $str = >>> " "; >>> HTML::Entities::decode_entities( $str ); print Dump($str)' >>> SV = PV(0x13b5b78) at 0x13dc920 >>> REFCNT = 1 >>> FLAGS = (POK,pPOK) >>> PV = 0x13d71d0 "\240"\0 >>> CUR = 1 >>> LEN = 8 >>> >>> $ perl -e 'use Encode; use Devel::Peek; use HTML::Entities; $str = >>> "€"; >>> HTML::Entities::decode_entities( $str ); print Dump($str)' >>> SV = PV(0x165fb78) at 0x1686920 >>> REFCNT = 1 >>> FLAGS = (POK,pPOK,UTF8) >>> PV = 0x16811d0 "\342\202\254"\0 [UTF8 "\x{20ac}"] >>> CUR = 3 >>> LEN = 8 >>> ================================================================ >>> >>> For the euro entity I see >>> "\342\202\254"\0 [UTF8 "\x{20ac}"] >>> >>> but for the nbsp entity I see >>> "\240"\0 >>> >>> No [UTF8 "\x{a0}"] >>> >>> >>> >>> Let me explain why I do expect U+00A0: >>> >>> http://www.w3.org/TR/html4/sgml/entities.html >>> >>> <quote> >>> <!ENTITY nbsp CDATA " " -- no-break space = non-breaking space, >>> U+00A0 ISOnum --> >>> </quote> >>> >>> >>> Regards >>> Vangelis >>> >>> >>> On 01/08/2013 02:01 PM, Victor Efimov wrote: >>>> >>>> >>>> Hi, Vangelis =) >>>> >>>> try >>>> >>>> perl -e 'use open qw/:std :utf8/; use Encode; use Data::Dumper; use >>>> HTML::Entities; $str = " "; HTML::Entities::decode_entities( $str >>>> ); print Dumper($str)' >>>> perl -e 'use Encode; use Devel::Peek; use HTML::Entities; $str = >>>> " "; HTML::Entities::decode_entities( $str ); print Dump($str)' >>>> >>>> 2013/1/8 Vangelis Katsikaros <ibo...@yahoo.gr>: >>>>> >>>>> >>>>> Hi >>>>> >>>>> First many thanks for all the familly of LWP, HTML excellent modules >>>>> and >>>>> the >>>>> work invested on them. >>>>> >>>>> >>>>> >>>>> My question concerns the decode_entities, unicode and *some* HTML >>>>> entities >>>>> (the ones in the range 128-255 chr() range) >>>>> >>>>> The manual says for decode_entities "This routine replaces HTML >>>>> entities >>>>> found in the $string with the corresponding Unicode character" >>>>> >>>>> So I was expecting that if I decode the nbsp entity I would get the >>>>> U+00A0 >>>>> character (in perl \x{A0}) >>>>> >>>>> >>>>> >>>>> I do: >>>>> ================================================================ >>>>> perl -e 'use Encode; use Data::Dumper; use HTML::Entities; $str = >>>>> " "; >>>>> HTML::Entities::decode_entities( $str ); print Dumper($str)' >>>>> >>>>> $VAR1 = '�'; >>>>> ================================================================ >>>>> I see on my terminal the replacement character - black diamond with >>>>> question >>>>> mark, whereas I would expect to see sth like : >>>>> $VAR1 = "\x{a0}"; >>>>> >>>>> >>>>> >>>>> >>>>> If I do the same with the euro enity: >>>>> ================================================================ >>>>> perl -e 'use Encode; use Data::Dumper; use HTML::Entities; $str = >>>>> "€"; >>>>> HTML::Entities::decode_entities( $str ); print Dumper($str)' >>>>> >>>>> $VAR1 = "\x{20ac}"; >>>>> ================================================================ >>>>> I do get the expected result (the perl U+20AC unicode character) >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Trying to dig a bit more I noticed the following: >>>>> ================================================================ >>>>> $ perl -e 'use HTML::Entities; $str = " "; >>>>> HTML::Entities::decode_entities( $str ); print $str' | hexdump -C >>>>> 00000000 a0 |.| >>>>> 00000001 >>>>> >>>>> perl -e 'use HTML::Entities; $str = "€"; >>>>> HTML::Entities::decode_entities( $str ); print $str' | hexdump -C >>>>> Wide character in print at -e line 1. >>>>> 00000000 e2 82 ac |...| >>>>> 00000003 >>>>> >>>>> perl -e 'use Encode; use HTML::Entities; $str = "€"; >>>>> HTML::Entities::decode_entities( $str ); $t = >>>>> Encode::encode("UTF-8",$str); >>>>> print $t' | hexdump -C >>>>> 00000000 e2 82 ac |...| >>>>> 00000003 >>>>> ================================================================ >>>>> >>>>> In the nbsp case I get the byte 'a0' whereas I would expect the bytes >>>>> 'c2 >>>>> a0' (for utf-8). >>>>> >>>>> In the 1st euro case I do get the bytes 'e2 82 ac' that are the proper >>>>> bytes >>>>> for U+20AC in utf-8. I do get a "Wide character in print" warning from >>>>> print(), because the str isn't encoded properly. >>>>> >>>>> In the 2nd euro case I get the same bytes (correct U+20AC in utf-8) and >>>>> no >>>>> warn message from print(), since I do encode properly. >>>>> >>>>> >>>>> >>>>> >>>>> So to rephrase my question: why don't I see "\x{a0}" (in the perl >>>>> sting), >>>>> or >>>>> 'c2a0' in the bytes streamed, when I decode the nbsp HTML entity? >>>>> Wouldn't >>>>> these be the expected results? >>>>> >>>>> Regards >>>>> Vangelis >>>>> >>>>> PS Forgive my ignorance if I say sth stupid. I think I do understand >>>>> some >>>>> aspects of unicode handling in perl, but I haven't run out of room for >>>>> improvement. >>>> >>>> >>>> >>> >> >