Thanks for all those test cases.

Out of curiosity, I looked at the code a bit and used your test HTML
file with bin/html2xhtml:

So what happens is:

lib/HTML/HTML5/Parser.pm: parse_file():
my $response = HTML::HTML5::Parser::UA->get($file, $opts->{user_agent});

lib/HTML/HTML5/Parser/UA.pm: get():
interestingly takes the _get_lwp route for file:/// and returns stuff

lib/HTML/HTML5/Parser.pm: parse_file():
then takes $response->{decoded_content};
which generates, when printed, a wide character warning, and
presumably from here on things go south

What helps is:
- replace in lib/HTML/HTML5/Parser.pm
  $response->{decoded_content} with $response->{content}
  which feels a bit dangerous
- or in lib/HTML/HTML5/Parser/UA.pm's get:
  move the
  if ($uri =~ /^file:/i)
  up so it's the first alternative and then _get_fs is used

The latter change would be, as a diff:

--- a/lib/HTML/HTML5/Parser/UA.pm
+++ b/lib/HTML/HTML5/Parser/UA.pm
@@ -18,14 +18,14 @@ sub get
        my ($class, $uri, $ua) = @_;

+       if ($uri =~ /^file:/i)
+               { goto \&_get_fs }
        if (ref $ua and $ua->isa('HTTP::Tiny') and $uri =~ /^https?:/i)
                { goto \&_get_tiny }
        if (ref $ua and $ua->isa('LWP::UserAgent'))
                { goto \&_get_lwp }
        if (UNIVERSAL::can('LWP::UserAgent', 'can') and not $NO_LWP)
                { goto \&_get_lwp }
-       if ($uri =~ /^file:/i)
-               { goto \&_get_fs }

        goto \&_get_tiny;

While this helps for reading local files, I guess the _get_lwp() case
might still be buggy.


