Bug#750946: libhtml-html5-parser-perl: UTF-8 character breaks parse_file

2017-08-07 Thread gregor herrmann
On Mon, 07 Aug 2017 08:48:36 -0700, Gregory Williams wrote:

> On Aug 7, 2017, at 8:26 AM, gregor herrmann  wrote:
> > This looks indeed much better than my crude workarounds, thanks for
> > that!
> > Do you think you can take this up with upstream?
> Yes, I think Kjetil and I can work on getting this merged upstream.

Aweseome, thank you!


Cheers,
gregor

-- 
 .''`.  https://info.comodo.priv.at/ - Debian Developer https://www.debian.org
 : :' : OpenPGP fingerprint D1E1 316E 93A7 60A8 104D  85FA BB3A 6801 8649 AA06
 `. `'  Member of VIBE!AT & SPI, fellow of the Free Software Foundation Europe
   `-   


signature.asc
Description: Digital Signature


Bug#750946: libhtml-html5-parser-perl: UTF-8 character breaks parse_file

2017-08-07 Thread Gregory Williams
On Aug 7, 2017, at 8:26 AM, gregor herrmann  wrote:
> 
> 
> This looks indeed much better than my crude workarounds, thanks for
> that!
> 
> Do you think you can take this up with upstream?

Yes, I think Kjetil and I can work on getting this merged upstream.

Thanks,
Greg



Bug#750946: libhtml-html5-parser-perl: UTF-8 character breaks parse_file

2017-08-07 Thread gregor herrmann
Control: tag -1 + patch

On Sun, 06 Aug 2017 18:11:34 -0700, Gregory Williams wrote:

> I also looked into this and found another possible fix:
> 
> diff -ru HTML-HTML5-Parser-0.301/lib/HTML/HTML5/Parser.pm 
> HTML-HTML5-Parser-0.301-patched/lib/HTML/HTML5/Parser.pm
> --- HTML-HTML5-Parser-0.301/lib/HTML/HTML5/Parser.pm  2013-07-08 
> 07:12:25.0 -0700
> +++ HTML-HTML5-Parser-0.301-patched/lib/HTML/HTML5/Parser.pm  2017-08-06 
> 12:42:58.0 -0700
> @@ -13,6 +13,7 @@
>  use HTML::HTML5::Parser::TagSoupParser;
>  use Scalar::Util qw(blessed);
>  use URI::file;
> +use Encode qw(encode_utf8);
>  use XML::LibXML;
>  
>  BEGIN {
> @@ -102,6 +103,11 @@
>   {
>  # XXX AGAIN DO THIS TO STOP ENORMOUS MEMORY LEAKS
>  my ($errh, $errors) = @{$self}{qw(error_handler errors)};
> +
> +if (utf8::is_utf8($text)) {
> + $text   = encode_utf8($text);
> +}
> +
>   $self->{parser}->parse_byte_string(
>  $opts->{'encoding'}, $text, $dom,
>  sub {
> 

This looks indeed much better than my crude workarounds, thanks for
that!

Do you think you can take this up with upstream?


Cheers,
gregor

-- 
 .''`.  https://info.comodo.priv.at/ - Debian Developer https://www.debian.org
 : :' : OpenPGP fingerprint D1E1 316E 93A7 60A8 104D  85FA BB3A 6801 8649 AA06
 `. `'  Member of VIBE!AT & SPI, fellow of the Free Software Foundation Europe
   `-   


signature.asc
Description: Digital Signature


Bug#750946: libhtml-html5-parser-perl: UTF-8 character breaks parse_file

2017-08-06 Thread Vincent Lefevre
On 2017-08-06 18:11:34 -0700, Gregory Williams wrote:
> The above patch should handle the LWP case which the previously
> suggest patch avoids. It still passes the test suite (which should
> probably be improved to verify this case), and also supports the
> test case detailed in this bug report (though I should mention that
> I believe the test script included by Vincent Lefevre includes a
> double-encoding bug as $doc->toString() actually returns utf8
> encoded bytes, which the :encoding(UTF-8) PerlIO layer on stdout
> will attempt to encode a second time).

Indeed. I would rather see this as a Perl design bug (at least
under UTF-8 locales).

-- 
Vincent Lefèvre  - Web: 
100% accessible validated (X)HTML - Blog: 
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)



Bug#750946: libhtml-html5-parser-perl: UTF-8 character breaks parse_file

2017-08-06 Thread Gregory Williams
On Sat, 5 Aug 2017 12:16:04 -0400 gregor herrmann  wrote:
> What helps is:
> - replace in lib/HTML/HTML5/Parser.pm
>   $response->{decoded_content} with $response->{content}
>   which feels a bit dangerous
> - or in lib/HTML/HTML5/Parser/UA.pm's get:
>   move the
>   if ($uri =~ /^file:/i)
>   up so it's the first alternative and then _get_fs is used
> 
> 
> The latter change would be, as a diff:
> 
> #v+
> --- a/lib/HTML/HTML5/Parser/UA.pm
> +++ b/lib/HTML/HTML5/Parser/UA.pm
> @@ -18,14 +18,14 @@ sub get
>  {
> my ($class, $uri, $ua) = @_;
> 
> +   if ($uri =~ /^file:/i)
> +   { goto \&_get_fs }
> if (ref $ua and $ua->isa('HTTP::Tiny') and $uri =~ /^https?:/i)
> { goto \&_get_tiny }
> if (ref $ua and $ua->isa('LWP::UserAgent'))
> { goto \&_get_lwp }
> if (UNIVERSAL::can('LWP::UserAgent', 'can') and not $NO_LWP)
> { goto \&_get_lwp }
> -   if ($uri =~ /^file:/i)
> -   { goto \&_get_fs }
> 
> 
> 
> While this helps for reading local files, I guess the _get_lwp() case
> might still be buggy.


I also looked into this and found another possible fix:

diff -ru HTML-HTML5-Parser-0.301/lib/HTML/HTML5/Parser.pm 
HTML-HTML5-Parser-0.301-patched/lib/HTML/HTML5/Parser.pm
--- HTML-HTML5-Parser-0.301/lib/HTML/HTML5/Parser.pm2013-07-08 
07:12:25.0 -0700
+++ HTML-HTML5-Parser-0.301-patched/lib/HTML/HTML5/Parser.pm2017-08-06 
12:42:58.0 -0700
@@ -13,6 +13,7 @@
 use HTML::HTML5::Parser::TagSoupParser;
 use Scalar::Util qw(blessed);
 use URI::file;
+use Encode qw(encode_utf8);
 use XML::LibXML;
 
 BEGIN {
@@ -102,6 +103,11 @@
{
 # XXX AGAIN DO THIS TO STOP ENORMOUS MEMORY LEAKS
 my ($errh, $errors) = @{$self}{qw(error_handler errors)};
+
+if (utf8::is_utf8($text)) {
+   $text   = encode_utf8($text);
+}
+
$self->{parser}->parse_byte_string(
 $opts->{'encoding'}, $text, $dom,
 sub {


Part of the underlying issue here is that many variables and methods in these 
modules are named in a confusing way, expecting/requiring encoded bytes, but 
using names which imply a desire for decoded strings.

The above patch should handle the LWP case which the previously suggest patch 
avoids. It still passes the test suite (which should probably be improved to 
verify this case), and also supports the test case detailed in this bug report 
(though I should mention that I believe the test script included by Vincent 
Lefevre includes a double-encoding bug as $doc->toString() actually returns 
utf8 encoded bytes, which the :encoding(UTF-8) PerlIO layer on stdout will 
attempt to encode a second time).

thanks,
.greg



Bug#750946: libhtml-html5-parser-perl: UTF-8 character breaks parse_file

2017-08-05 Thread gregor herrmann
On Wed, 22 Oct 2014 14:13:17 +0200, Vincent Lefevre wrote:

> Control: retitle -1 libhtml-html5-parser-perl: UTF-8 character breaks 
> parse_file
> 
> As a consequence of this bug, html2xhtml doesn't work at all when
> applied on a file. No problems when the HTML document is provided
> in the standard input, though. 
[..]
> parse_file is used in the former test (like in my original bug report),
> and parse_string is used in the latter test. Thus it seems that's
> parse_file that is broken. Hence the retitle.

Thanks for all those test cases.

Out of curiosity, I looked at the code a bit and used your test HTML
file with bin/html2xhtml:

So what happens is:

lib/HTML/HTML5/Parser.pm: parse_file():
my $response = HTML::HTML5::Parser::UA->get($file, $opts->{user_agent});

lib/HTML/HTML5/Parser/UA.pm: get():
interestingly takes the _get_lwp route for file:/// and returns stuff

lib/HTML/HTML5/Parser.pm: parse_file():
then takes $response->{decoded_content};
which generates, when printed, a wide character warning, and
presumably from here on things go south

What helps is:
- replace in lib/HTML/HTML5/Parser.pm
  $response->{decoded_content} with $response->{content}
  which feels a bit dangerous
- or in lib/HTML/HTML5/Parser/UA.pm's get:
  move the
  if ($uri =~ /^file:/i)
  up so it's the first alternative and then _get_fs is used


The latter change would be, as a diff:

#v+
--- a/lib/HTML/HTML5/Parser/UA.pm
+++ b/lib/HTML/HTML5/Parser/UA.pm
@@ -18,14 +18,14 @@ sub get
 {
my ($class, $uri, $ua) = @_;

+   if ($uri =~ /^file:/i)
+   { goto \&_get_fs }
if (ref $ua and $ua->isa('HTTP::Tiny') and $uri =~ /^https?:/i)
{ goto \&_get_tiny }
if (ref $ua and $ua->isa('LWP::UserAgent'))
{ goto \&_get_lwp }
if (UNIVERSAL::can('LWP::UserAgent', 'can') and not $NO_LWP)
{ goto \&_get_lwp }
-   if ($uri =~ /^file:/i)
-   { goto \&_get_fs }

goto \&_get_tiny;
 }
#v-


While this helps for reading local files, I guess the _get_lwp() case
might still be buggy.


Cheers,
gregor


-- 
 .''`.  https://info.comodo.priv.at/ - Debian Developer https://www.debian.org
 : :' : OpenPGP fingerprint D1E1 316E 93A7 60A8 104D  85FA BB3A 6801 8649 AA06
 `. `'  Member of VIBE!AT & SPI, fellow of the Free Software Foundation Europe
   `-   


signature.asc
Description: Digital Signature