Bug#787821: libhtml-parser-perl: encode_entities() convert chars to Atilde; instead of their proper entity

2015-06-06 Thread Mathieu ROY
Hi Gregor,

Le vendredi 5 juin 2015, 17:21:18 gregor herrmann a écrit :
 
 In this case I'd probably try with use utf8::all; or told open()
 
 about the encoding:
$ cat test.pl
  
  #!/usr/bin/perl
  use utf8;
  use HTML::Entities;
  
  open(INPUT,  testdata);
 
 open(my $fh,':encoding(utf8)', 'testdata');
 
 (Untested.)

Tested, it works.

But then again, this can be done  this way only if we are 100% positive that 
input is always UTF-8 (which is 
not the case of my script - so I'm back to testing the input and it's still 
even easier to decode it).

I guess then apart from the missing --utf8 from pod2man there is no bug here 
and this report can be 
closed. 

Still, even though, as pointed out, I could have found the answer by checking 
general perl doc about 
encoding, maybe just a line in the HTML::Entities man about it could be useful.
Nowadays, you can expect input to be very often UTF-8.


-- 
http://yeupou.wordpress.com/


Bug#787821: libhtml-parser-perl: encode_entities() convert chars to Atilde; instead of their proper entity

2015-06-05 Thread Mathieu Roy
Package: libhtml-parser-perl
Version: 3.71-1+b3
Severity: important

Hello,

According to http://search.cpan.org/dist/HTML-Parser/lib/HTML/Entities.pm


 use HTML::Entities;
 $input = vis-à-vis Beyoncé's naïve\npapier-mâché résumé;
 print encode_entities($input), \n

print 

 vis-agrave;-vis Beyonceacute;'s naiuml;ve
 papier-macirc;cheacute; reacute;sumeacute;


That's correct.


However, here:

  $ cat test.pl 
#!/usr/bin/perl

use HTML::Entities;
$input = vis-à-vis Beyoncé's naïve\npapier-mâché résumé;
print encode_entities($input), \n

# EOF 

  $ perl test.pl 
vis-Atilde;nbsp;-vis BeyoncAtilde;copy;#39;s naAtilde;macr;ve
papier-mAtilde;cent;chAtilde;copy; rAtilde;copy;sumAtilde;copy;


Where do these Atilde; come from?
According to http://www.w3schools.com/charsets/ref_html_entities_4.asp it's for 
Ã.

I tested the same script on a debian stable and on some ubuntu with the exact 
same result.

I dont know what I'm doing wrong here but a simple copy/paste of the documented 
example does not work.

Other similar commands work as expected. For instance:

echo vis-à-vis Beyoncé's naïve\npapier-mâché résumé | recode utf8..html
vis-agrave;-vis Beyonceacute;'s naiuml;ve\npapier-macirc;cheacute; 
reacute;sumeacute;




Plus, as a side bug (require a report on its own?),
man HTML::Entities prints

   For example, this:

$input = vis-a-vis Beyonce's naieve\npapier-mache resume;
print encode_entities($input), \n

   Prints this out:

[...]

Yes, the man page example is actually stripped of entities to encode!






-- System Information:
Debian Release: stretch/sid
  APT prefers testing
  APT policy: (990, 'testing'), (500, 'unstable'), (1, 'experimental')
Architecture: amd64 (x86_64)

Kernel: Linux 3.16.0-4-amd64 (SMP w/6 CPU cores)
Locale: LANG=fr_FR.UTF-8, LC_CTYPE=fr_FR.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: sysvinit (via /sbin/init)

Versions of packages libhtml-parser-perl depends on:
ii  libc6   2.19-18
ii  libhtml-tagset-perl 3.20-2
ii  liburi-perl 1.64-1
ii  perl5.20.2-6
ii  perl-base [perlapi-5.20.1]  5.20.2-6

libhtml-parser-perl recommends no packages.

Versions of packages libhtml-parser-perl suggests:
pn  libdata-dump-perl  none

-- no debconf information


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#787821: libhtml-parser-perl: encode_entities() convert chars to Atilde; instead of their proper entity

2015-06-05 Thread Damyan Ivanov
-=| Mathieu Roy, 05.06.2015 13:35:24 +0200 |=-
 Package: libhtml-parser-perl
 Version: 3.71-1+b3
 Severity: important
 
 Hello,
 
 According to http://search.cpan.org/dist/HTML-Parser/lib/HTML/Entities.pm
 
 
  use HTML::Entities;
  $input = vis-à-vis Beyoncé's naïve\npapier-mâché résumé;
  print encode_entities($input), \n
 
 print 
 
  vis-agrave;-vis Beyonceacute;'s naiuml;ve
  papier-macirc;cheacute; reacute;sumeacute;
 
 
 That's correct.
 
 
 However, here:
 
   $ cat test.pl 
 #!/usr/bin/perl
 
 use HTML::Entities;
 $input = vis-à-vis Beyoncé's naïve\npapier-mâché résumé;
 print encode_entities($input), \n
 
 # EOF 
 
   $ perl test.pl 
 vis-Atilde;nbsp;-vis BeyoncAtilde;copy;#39;s naAtilde;macr;ve
 papier-mAtilde;cent;chAtilde;copy; rAtilde;copy;sumAtilde;copy;

I can confirm that. However, adding use utf8; to the test script 
fixes the output. So it seems to me that your test file is encoded in 
utf8 and you need to tell that to perl.

HTML::Entities encodes characters, and it depends on perl's 
interpretation of the source text. Without an explicit 'use utf8' it 
is considered to be Latin1, which I think leads to the garbage above.

If you recode the test file in latin1, everything will work as 
expected, since latin1 is the default encoding.

 Where do these Atilde; come from?
 According to http://www.w3schools.com/charsets/ref_html_entities_4.asp it's 
 for Ã.
 
 I tested the same script on a debian stable and on some ubuntu with the exact 
 same result.
 
 I dont know what I'm doing wrong here but a simple copy/paste of the 
 documented example does not work.

I guess the documentation needs 'use utf8;' somewhere or maybe 
something more generic, since the same text may be encoded in latin1.

 Other similar commands work as expected. For instance:
 
 echo vis-à-vis Beyoncé's naïve\npapier-mâché résumé | recode utf8..html
 vis-agrave;-vis Beyonceacute;'s naiuml;ve\npapier-macirc;cheacute; 
 reacute;sumeacute;
 
 
 
 
 Plus, as a side bug (require a report on its own?),
 man HTML::Entities prints
 
For example, this:
 
 $input = vis-a-vis Beyonce's naieve\npapier-mache resume;
 print encode_entities($input), \n
 
Prints this out:
 
 [...]
 
 Yes, the man page example is actually stripped of entities to encode!

Not sure where the problem is here. perldoc works fine:

 perldoc HTML::Entities

pod2man /usr/lib/x86_64-linux-gnu/perl5/5.20/HTML/Entities.pm 
generates stuff like:

 \ $input = vis\-a\*`\-vis Beyonce\*'\*(Aqs 
 nai\*:ve\enpapier\-ma\*^che\*' re\*'sume\*';

Which I guess is *roff speak for accents.

Adding --utf8 seems to get it right:

 pod2man --utf8 /usr/lib/x86_64-linux-gnu/perl5/5.20/HTML/Entities.pm \
 |   man -l -


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#787821: libhtml-parser-perl: encode_entities() convert chars to Atilde; instead of their proper entity

2015-06-05 Thread gregor herrmann
On Fri, 05 Jun 2015 14:34:42 +0200, Mathieu ROY wrote:

 Ok, so after further testing, it turns out that if I change the coding of the 
 string from UTF-8 to ISO-8859..., it encode to the proper entities.

Good.
 
 I obviously can adjust the script to pre convert UTF-8 to ISO-8859 

Or just add use utf8; to your script if it contains utf8-encoded
strings.

 but it 
 should be at least documented (but I dont see any reason why encode_entities 
 should actually not be able to deal with UTF-8)

That's how encoding in perl works in general, and I'm sure it's
documented somewhere :)
(I just don't find the correct perldoc right now ...)


Cheers,
gregor
-- 
 .''`.  Homepage: http://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06
 : :' : Debian GNU/Linux user, admin, and developer -  https://www.debian.org/
 `. `'  Member of VIBE!AT  SPI, fellow of the Free Software Foundation Europe
   `-   NP: Treibhaus: Garish


signature.asc
Description: Digital Signature


Bug#787821: libhtml-parser-perl: encode_entities() convert chars to Atilde; instead of their proper entity

2015-06-05 Thread Mathieu ROY
Ok, so after further testing, it turns out that if I change the coding of the 
string from UTF-8 to ISO-8859..., it encode to the proper entities.

I obviously can adjust the script to pre convert UTF-8 to ISO-8859 but it 
should be at least documented (but I dont see any reason why encode_entities 
should actually not be able to deal with UTF-8)

Regards


-- 
http://yeupou.wordpress.com/


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#787821: libhtml-parser-perl: encode_entities() convert chars to Atilde; instead of their proper entity

2015-06-05 Thread gregor herrmann
On Fri, 05 Jun 2015 13:35:24 +0200, Mathieu Roy wrote:

 However, here:
 
   $ cat test.pl 
 #!/usr/bin/perl
 
 use HTML::Entities;
 $input = vis-à-vis Beyoncé's naïve\npapier-mâché résumé;
 print encode_entities($input), \n
 
 # EOF 
 
   $ perl test.pl 
 vis-Atilde;nbsp;-vis BeyoncAtilde;copy;#39;s naAtilde;macr;ve
 papier-mAtilde;cent;chAtilde;copy; rAtilde;copy;sumAtilde;copy;

Oh, fun with encodings in general and UTF-8 in particular again.

This works:

% cat test.pl 
#!/usr/bin/perl

use utf8;

use HTML::Entities;
$input = vis-à-vis Beyoncé's naïve\npapier-mâché résumé;
print encode_entities($input), \n


% perl test.pl
vis-agrave;-vis Beyonceacute;#39;s naiuml;ve
papier-macirc;cheacute; reacute;sumeacute;

 Where do these Atilde; come from?

From perl not knowing that the script ins utf8-encoded and taking it
as Latin1 or something.


So, I'm not sure there is actually a bug somewhere.
With use utf8; this works, and perl needs to be told about the
encoding ...


 Plus, as a side bug (require a report on its own?),
 man HTML::Entities prints
 
For example, this:
 
 $input = vis-a-vis Beyonce's naieve\npapier-mache resume;
 print encode_entities($input), \n
 
Prints this out:
 
 [...]
 
 Yes, the man page example is actually stripped of entities to encode!

Ouch, ugly.
Yes, please report a separate bug. 


Cheers,
gregor

-- 
 .''`.  Homepage: http://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06
 : :' : Debian GNU/Linux user, admin, and developer -  https://www.debian.org/
 `. `'  Member of VIBE!AT  SPI, fellow of the Free Software Foundation Europe
   `-   NP: Penelope Swales: Lost  Found


signature.asc
Description: Digital Signature


Bug#787821: libhtml-parser-perl: encode_entities() convert chars to Atilde; instead of their proper entity

2015-06-05 Thread Damyan Ivanov
-=| Mathieu ROY, 05.06.2015 14:34:42 +0200 |=-
 Ok, so after further testing, it turns out that if I change the coding of the 
 string from UTF-8 to ISO-8859..., it encode to the proper entities.

This is because in the absence of explicit encoding statement the perl 
interpreter consider the source text to be encoded in Latin1.

From 'perldoc encoding', Implicit upgrading for byte strings

   By default, if strings operating under byte semantics and
   strings with Unicode character data are concatenated, the new
   string will be created by decoding the byte strings as ISO
   8859-1 (Latin-1).
   The encoding pragma changes this to use the specified
   encoding instead.

(Although note that the encoding pragma is deprecated. Better use the 
utf8 pragma and encode your source as UTF-8).

 I obviously can adjust the script to pre convert UTF-8 to ISO-8859 
 but it should be at least documented (but I dont see any reason why 
 encode_entities should actually not be able to deal with UTF-8)

encode_entities deals with whatever the perl interpreter supplies. And 
the perl interpreter needs your help in determining the meaning of the 
byte sequence you feed it with.


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#787821: libhtml-parser-perl: encode_entities() convert chars to Atilde; instead of their proper entity

2015-06-05 Thread Mathieu ROY
 Le vendredi 5 juin 2015 14:31:17, vous avez écrit :
 On Fri, 05 Jun 2015 14:34:42 +0200, Mathieu ROY wrote:
  Ok, so after further testing, it turns out that if I change the coding of
  the
  string from UTF-8 to ISO-8859..., it encode to the proper entities.
 
 Good.
 
  I obviously can adjust the script to pre convert UTF-8 to ISO-8859
 
 Or just add use utf8; to your script if it contains utf8-encoded
 strings.

That works for the test script allright.

But in the script I'm actually working on, the string is imported from an 
image exif data. And in this case, use utf8 has no effect at all. The string is 
utf8 and encode_entities fails to convert it properly.

Instead of keeping strings UTF-8 and expecting HTML::Entities to cope properly 
with it (it does not), I actually need to do the contrary: convert UTF-8 to 
perl internal format and then call encode entities.



Consider the following:

  $ cat test.pl 
#!/usr/bin/perl
use utf8;
use HTML::Entities;

open(INPUT,  testdata);
while (INPUT) {
print encode_entities($_), \n
}
close(INPUT);

  $ echo vis-à-vis Beyoncé's naïve\npapier-mâché résumé  testdata 

  $ perl test.pl 
vis-Atilde;nbsp;-vis BeyoncAtilde;copy;#39;s naAtilde;macr;ve\npapier-
mAtilde;cent;chAtilde;copy; rAtilde;copy;sumAtilde;copy;


Back to square one.

Now, without use utf8; but decoding:

#!/usr/bin/perl

use HTML::Entities;
use Encode qw(decode);
use Encode::Detect::Detector;

open(INPUT,  testdata);
while (INPUT) {
print encode_entities(decode(detect($_),$_)), \n
}
close(INPUT);

  $ perl test.pl 
vis-agrave;-vis Beyonceacute;#39;s naiuml;ve\npapier-macirc;cheacute; 
reacute;sumeacute;


  but it
  should be at least documented (but I dont see any reason why
  encode_entities
  should actually not be able to deal with UTF-8)
 
 That's how encoding in perl works in general, and I'm sure it's
 documented somewhere :)
 (I just don't find the correct perldoc right now ...)

I expected these use utf8/no utf8 to be sort of transitional and thought 
should be avoided whenever not absolutely necessary.

Description of use utf8; mentions:

When UTF-8 becomes the standard source format, this pragma will effectively 
become a no-op.

Well, that day, if that day comes, HTML::Entities will definitely have to deal 
properly with UTF-8 first hand. :-)

Anyway, in the meantime, I tend to prefer forcing strings to be decoded into 
internal format than saying that all strings are UTF-8.

Regards,


-- 
http://yeupou.wordpress.com/


--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#787821: libhtml-parser-perl: encode_entities() convert chars to Atilde; instead of their proper entity

2015-06-05 Thread gregor herrmann
On Fri, 05 Jun 2015 16:20:24 +0200, Mathieu ROY wrote:

   I obviously can adjust the script to pre convert UTF-8 to ISO-8859
  Or just add use utf8; to your script if it contains utf8-encoded
  strings.
 That works for the test script allright.
 But in the script I'm actually working on, the string is imported from an 
 image exif data. And in this case, use utf8 has no effect at all. 

Right, use utf8; only affects the _script_ but not input and
output.

 The string is 
 utf8 and encode_entities fails to convert it properly.

In this case I'd probably try with use utf8::all; or told open()
about the encoding:

   $ cat test.pl 
 #!/usr/bin/perl
 use utf8;
 use HTML::Entities;
 
 open(INPUT,  testdata);

open(my $fh,':encoding(utf8)', 'testdata');

(Untested.)

 When UTF-8 becomes the standard source format, this pragma will effectively 
 become a no-op.
 
 Well, that day, if that day comes, HTML::Entities will definitely have to 
 deal 
 properly with UTF-8 first hand. :-)

In my understanding, HTML::Entities doesn't have a problem with
UTF-8; it's just about telling perl itself, how the data in the
script or read from an external file are encoded.
 

Cheers,
gregor

-- 
 .''`.  Homepage: http://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06
 : :' : Debian GNU/Linux user, admin, and developer -  https://www.debian.org/
 `. `'  Member of VIBE!AT  SPI, fellow of the Free Software Foundation Europe
   `-   NP: Peter, Paul and Mary: For Loving Me


signature.asc
Description: Digital Signature