Bug#787821: libhtml-parser-perl: encode_entities() convert chars to Atilde; instead of their proper entity
Hi Gregor, Le vendredi 5 juin 2015, 17:21:18 gregor herrmann a écrit : In this case I'd probably try with use utf8::all; or told open() about the encoding: $ cat test.pl #!/usr/bin/perl use utf8; use HTML::Entities; open(INPUT, testdata); open(my $fh,':encoding(utf8)', 'testdata'); (Untested.) Tested, it works. But then again, this can be done this way only if we are 100% positive that input is always UTF-8 (which is not the case of my script - so I'm back to testing the input and it's still even easier to decode it). I guess then apart from the missing --utf8 from pod2man there is no bug here and this report can be closed. Still, even though, as pointed out, I could have found the answer by checking general perl doc about encoding, maybe just a line in the HTML::Entities man about it could be useful. Nowadays, you can expect input to be very often UTF-8. -- http://yeupou.wordpress.com/
Bug#787821: libhtml-parser-perl: encode_entities() convert chars to Atilde; instead of their proper entity
Package: libhtml-parser-perl Version: 3.71-1+b3 Severity: important Hello, According to http://search.cpan.org/dist/HTML-Parser/lib/HTML/Entities.pm use HTML::Entities; $input = vis-à-vis Beyoncé's naïve\npapier-mâché résumé; print encode_entities($input), \n print vis-agrave;-vis Beyonceacute;'s naiuml;ve papier-macirc;cheacute; reacute;sumeacute; That's correct. However, here: $ cat test.pl #!/usr/bin/perl use HTML::Entities; $input = vis-à-vis Beyoncé's naïve\npapier-mâché résumé; print encode_entities($input), \n # EOF $ perl test.pl vis-Atilde;nbsp;-vis BeyoncAtilde;copy;#39;s naAtilde;macr;ve papier-mAtilde;cent;chAtilde;copy; rAtilde;copy;sumAtilde;copy; Where do these Atilde; come from? According to http://www.w3schools.com/charsets/ref_html_entities_4.asp it's for Ã. I tested the same script on a debian stable and on some ubuntu with the exact same result. I dont know what I'm doing wrong here but a simple copy/paste of the documented example does not work. Other similar commands work as expected. For instance: echo vis-à-vis Beyoncé's naïve\npapier-mâché résumé | recode utf8..html vis-agrave;-vis Beyonceacute;'s naiuml;ve\npapier-macirc;cheacute; reacute;sumeacute; Plus, as a side bug (require a report on its own?), man HTML::Entities prints For example, this: $input = vis-a-vis Beyonce's naieve\npapier-mache resume; print encode_entities($input), \n Prints this out: [...] Yes, the man page example is actually stripped of entities to encode! -- System Information: Debian Release: stretch/sid APT prefers testing APT policy: (990, 'testing'), (500, 'unstable'), (1, 'experimental') Architecture: amd64 (x86_64) Kernel: Linux 3.16.0-4-amd64 (SMP w/6 CPU cores) Locale: LANG=fr_FR.UTF-8, LC_CTYPE=fr_FR.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/dash Init: sysvinit (via /sbin/init) Versions of packages libhtml-parser-perl depends on: ii libc6 2.19-18 ii libhtml-tagset-perl 3.20-2 ii liburi-perl 1.64-1 ii perl5.20.2-6 ii perl-base [perlapi-5.20.1] 5.20.2-6 libhtml-parser-perl recommends no packages. Versions of packages libhtml-parser-perl suggests: pn libdata-dump-perl none -- no debconf information -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#787821: libhtml-parser-perl: encode_entities() convert chars to Atilde; instead of their proper entity
-=| Mathieu Roy, 05.06.2015 13:35:24 +0200 |=- Package: libhtml-parser-perl Version: 3.71-1+b3 Severity: important Hello, According to http://search.cpan.org/dist/HTML-Parser/lib/HTML/Entities.pm use HTML::Entities; $input = vis-à-vis Beyoncé's naïve\npapier-mâché résumé; print encode_entities($input), \n print vis-agrave;-vis Beyonceacute;'s naiuml;ve papier-macirc;cheacute; reacute;sumeacute; That's correct. However, here: $ cat test.pl #!/usr/bin/perl use HTML::Entities; $input = vis-à-vis Beyoncé's naïve\npapier-mâché résumé; print encode_entities($input), \n # EOF $ perl test.pl vis-Atilde;nbsp;-vis BeyoncAtilde;copy;#39;s naAtilde;macr;ve papier-mAtilde;cent;chAtilde;copy; rAtilde;copy;sumAtilde;copy; I can confirm that. However, adding use utf8; to the test script fixes the output. So it seems to me that your test file is encoded in utf8 and you need to tell that to perl. HTML::Entities encodes characters, and it depends on perl's interpretation of the source text. Without an explicit 'use utf8' it is considered to be Latin1, which I think leads to the garbage above. If you recode the test file in latin1, everything will work as expected, since latin1 is the default encoding. Where do these Atilde; come from? According to http://www.w3schools.com/charsets/ref_html_entities_4.asp it's for Ã. I tested the same script on a debian stable and on some ubuntu with the exact same result. I dont know what I'm doing wrong here but a simple copy/paste of the documented example does not work. I guess the documentation needs 'use utf8;' somewhere or maybe something more generic, since the same text may be encoded in latin1. Other similar commands work as expected. For instance: echo vis-à-vis Beyoncé's naïve\npapier-mâché résumé | recode utf8..html vis-agrave;-vis Beyonceacute;'s naiuml;ve\npapier-macirc;cheacute; reacute;sumeacute; Plus, as a side bug (require a report on its own?), man HTML::Entities prints For example, this: $input = vis-a-vis Beyonce's naieve\npapier-mache resume; print encode_entities($input), \n Prints this out: [...] Yes, the man page example is actually stripped of entities to encode! Not sure where the problem is here. perldoc works fine: perldoc HTML::Entities pod2man /usr/lib/x86_64-linux-gnu/perl5/5.20/HTML/Entities.pm generates stuff like: \ $input = vis\-a\*`\-vis Beyonce\*'\*(Aqs nai\*:ve\enpapier\-ma\*^che\*' re\*'sume\*'; Which I guess is *roff speak for accents. Adding --utf8 seems to get it right: pod2man --utf8 /usr/lib/x86_64-linux-gnu/perl5/5.20/HTML/Entities.pm \ | man -l - -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#787821: libhtml-parser-perl: encode_entities() convert chars to Atilde; instead of their proper entity
On Fri, 05 Jun 2015 14:34:42 +0200, Mathieu ROY wrote: Ok, so after further testing, it turns out that if I change the coding of the string from UTF-8 to ISO-8859..., it encode to the proper entities. Good. I obviously can adjust the script to pre convert UTF-8 to ISO-8859 Or just add use utf8; to your script if it contains utf8-encoded strings. but it should be at least documented (but I dont see any reason why encode_entities should actually not be able to deal with UTF-8) That's how encoding in perl works in general, and I'm sure it's documented somewhere :) (I just don't find the correct perldoc right now ...) Cheers, gregor -- .''`. Homepage: http://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06 : :' : Debian GNU/Linux user, admin, and developer - https://www.debian.org/ `. `' Member of VIBE!AT SPI, fellow of the Free Software Foundation Europe `- NP: Treibhaus: Garish signature.asc Description: Digital Signature
Bug#787821: libhtml-parser-perl: encode_entities() convert chars to Atilde; instead of their proper entity
Ok, so after further testing, it turns out that if I change the coding of the string from UTF-8 to ISO-8859..., it encode to the proper entities. I obviously can adjust the script to pre convert UTF-8 to ISO-8859 but it should be at least documented (but I dont see any reason why encode_entities should actually not be able to deal with UTF-8) Regards -- http://yeupou.wordpress.com/ -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#787821: libhtml-parser-perl: encode_entities() convert chars to Atilde; instead of their proper entity
On Fri, 05 Jun 2015 13:35:24 +0200, Mathieu Roy wrote: However, here: $ cat test.pl #!/usr/bin/perl use HTML::Entities; $input = vis-à-vis Beyoncé's naïve\npapier-mâché résumé; print encode_entities($input), \n # EOF $ perl test.pl vis-Atilde;nbsp;-vis BeyoncAtilde;copy;#39;s naAtilde;macr;ve papier-mAtilde;cent;chAtilde;copy; rAtilde;copy;sumAtilde;copy; Oh, fun with encodings in general and UTF-8 in particular again. This works: % cat test.pl #!/usr/bin/perl use utf8; use HTML::Entities; $input = vis-à-vis Beyoncé's naïve\npapier-mâché résumé; print encode_entities($input), \n % perl test.pl vis-agrave;-vis Beyonceacute;#39;s naiuml;ve papier-macirc;cheacute; reacute;sumeacute; Where do these Atilde; come from? From perl not knowing that the script ins utf8-encoded and taking it as Latin1 or something. So, I'm not sure there is actually a bug somewhere. With use utf8; this works, and perl needs to be told about the encoding ... Plus, as a side bug (require a report on its own?), man HTML::Entities prints For example, this: $input = vis-a-vis Beyonce's naieve\npapier-mache resume; print encode_entities($input), \n Prints this out: [...] Yes, the man page example is actually stripped of entities to encode! Ouch, ugly. Yes, please report a separate bug. Cheers, gregor -- .''`. Homepage: http://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06 : :' : Debian GNU/Linux user, admin, and developer - https://www.debian.org/ `. `' Member of VIBE!AT SPI, fellow of the Free Software Foundation Europe `- NP: Penelope Swales: Lost Found signature.asc Description: Digital Signature
Bug#787821: libhtml-parser-perl: encode_entities() convert chars to Atilde; instead of their proper entity
-=| Mathieu ROY, 05.06.2015 14:34:42 +0200 |=- Ok, so after further testing, it turns out that if I change the coding of the string from UTF-8 to ISO-8859..., it encode to the proper entities. This is because in the absence of explicit encoding statement the perl interpreter consider the source text to be encoded in Latin1. From 'perldoc encoding', Implicit upgrading for byte strings By default, if strings operating under byte semantics and strings with Unicode character data are concatenated, the new string will be created by decoding the byte strings as ISO 8859-1 (Latin-1). The encoding pragma changes this to use the specified encoding instead. (Although note that the encoding pragma is deprecated. Better use the utf8 pragma and encode your source as UTF-8). I obviously can adjust the script to pre convert UTF-8 to ISO-8859 but it should be at least documented (but I dont see any reason why encode_entities should actually not be able to deal with UTF-8) encode_entities deals with whatever the perl interpreter supplies. And the perl interpreter needs your help in determining the meaning of the byte sequence you feed it with. -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#787821: libhtml-parser-perl: encode_entities() convert chars to Atilde; instead of their proper entity
Le vendredi 5 juin 2015 14:31:17, vous avez écrit : On Fri, 05 Jun 2015 14:34:42 +0200, Mathieu ROY wrote: Ok, so after further testing, it turns out that if I change the coding of the string from UTF-8 to ISO-8859..., it encode to the proper entities. Good. I obviously can adjust the script to pre convert UTF-8 to ISO-8859 Or just add use utf8; to your script if it contains utf8-encoded strings. That works for the test script allright. But in the script I'm actually working on, the string is imported from an image exif data. And in this case, use utf8 has no effect at all. The string is utf8 and encode_entities fails to convert it properly. Instead of keeping strings UTF-8 and expecting HTML::Entities to cope properly with it (it does not), I actually need to do the contrary: convert UTF-8 to perl internal format and then call encode entities. Consider the following: $ cat test.pl #!/usr/bin/perl use utf8; use HTML::Entities; open(INPUT, testdata); while (INPUT) { print encode_entities($_), \n } close(INPUT); $ echo vis-à-vis Beyoncé's naïve\npapier-mâché résumé testdata $ perl test.pl vis-Atilde;nbsp;-vis BeyoncAtilde;copy;#39;s naAtilde;macr;ve\npapier- mAtilde;cent;chAtilde;copy; rAtilde;copy;sumAtilde;copy; Back to square one. Now, without use utf8; but decoding: #!/usr/bin/perl use HTML::Entities; use Encode qw(decode); use Encode::Detect::Detector; open(INPUT, testdata); while (INPUT) { print encode_entities(decode(detect($_),$_)), \n } close(INPUT); $ perl test.pl vis-agrave;-vis Beyonceacute;#39;s naiuml;ve\npapier-macirc;cheacute; reacute;sumeacute; but it should be at least documented (but I dont see any reason why encode_entities should actually not be able to deal with UTF-8) That's how encoding in perl works in general, and I'm sure it's documented somewhere :) (I just don't find the correct perldoc right now ...) I expected these use utf8/no utf8 to be sort of transitional and thought should be avoided whenever not absolutely necessary. Description of use utf8; mentions: When UTF-8 becomes the standard source format, this pragma will effectively become a no-op. Well, that day, if that day comes, HTML::Entities will definitely have to deal properly with UTF-8 first hand. :-) Anyway, in the meantime, I tend to prefer forcing strings to be decoded into internal format than saying that all strings are UTF-8. Regards, -- http://yeupou.wordpress.com/ -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#787821: libhtml-parser-perl: encode_entities() convert chars to Atilde; instead of their proper entity
On Fri, 05 Jun 2015 16:20:24 +0200, Mathieu ROY wrote: I obviously can adjust the script to pre convert UTF-8 to ISO-8859 Or just add use utf8; to your script if it contains utf8-encoded strings. That works for the test script allright. But in the script I'm actually working on, the string is imported from an image exif data. And in this case, use utf8 has no effect at all. Right, use utf8; only affects the _script_ but not input and output. The string is utf8 and encode_entities fails to convert it properly. In this case I'd probably try with use utf8::all; or told open() about the encoding: $ cat test.pl #!/usr/bin/perl use utf8; use HTML::Entities; open(INPUT, testdata); open(my $fh,':encoding(utf8)', 'testdata'); (Untested.) When UTF-8 becomes the standard source format, this pragma will effectively become a no-op. Well, that day, if that day comes, HTML::Entities will definitely have to deal properly with UTF-8 first hand. :-) In my understanding, HTML::Entities doesn't have a problem with UTF-8; it's just about telling perl itself, how the data in the script or read from an external file are encoded. Cheers, gregor -- .''`. Homepage: http://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06 : :' : Debian GNU/Linux user, admin, and developer - https://www.debian.org/ `. `' Member of VIBE!AT SPI, fellow of the Free Software Foundation Europe `- NP: Peter, Paul and Mary: For Loving Me signature.asc Description: Digital Signature