Bug#633511: libwww-perl: Incorrect encoding handling for text/html files with LWP::Simple::get and insufficient documentation

2017-03-31 Thread Vincent Lefevre
Control: forwarded -1 https://github.com/libwww-perl/libwww-perl/issues/226

On 2011-07-11 03:30:45 +0200, Vincent Lefevre wrote:
> This bug report is more or less what I gave on
> 
>   https://rt.cpan.org/Public/Bug/Display.html?id=69393

The ticket migrated to github.

-- 
Vincent Lefèvre  - Web: 
100% accessible validated (X)HTML - Blog: 
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)



Bug#633511: libwww-perl: Incorrect encoding handling for text/html files with LWP::Simple::get and insufficient documentation

2011-07-11 Thread Vincent Lefevre
I forgot to say about the files used in my tests...

Concerning the file contents:
  * perl-lwp-test1a.xml and perl-lwp-test1h.xml have the same
contents, which are also valid in the UTF-8 encoding.
  * perl-lwp-test2a.xml and perl-lwp-test2h.xml have the same
contents, which are not valid in the UTF-8 encoding.

Concerning the HTTP headers:
  * perl-lwp-test1a.xml and perl-lwp-test2a.xml are served as
application/xml, with no associated HTTP charset. This case
is covered by RFC 3023.
  * perl-lwp-test1h.xml and perl-lwp-test2h.xml are served as
text/html, with no associated HTTP charset.

-- 
Vincent Lefèvre vinc...@vinc17.net - Web: http://www.vinc17.net/
100% accessible validated (X)HTML - Blog: http://www.vinc17.net/blog/
Work: CR INRIA - computer arithmetic / Arénaire project (LIP, ENS-Lyon)



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#633511: libwww-perl: Incorrect encoding handling for text/html files with LWP::Simple::get and insufficient documentation

2011-07-10 Thread Vincent Lefevre
Package: libwww-perl
Version: 6.02-1
Severity: normal
Tags: upstream

This bug report is more or less what I gave on

  https://rt.cpan.org/Public/Bug/Display.html?id=69393

with some additional information concerning Debian.

When a file declared as iso-8859-1 and served as text/html is also
a valid UTF-8 file, LWP::Simple::get from libwww-perl 6.02 regards
it as a UTF-8 encoded file. This is incorrect.

For instance, with lwp-dump being

#!/usr/bin/env perl

use strict;
use Devel::Peek;
use LWP::Simple;

@ARGV == 1 or die Usage: $0 URL\n;
my $url = shift;
my $file = LWP::Simple::get($url);
defined $file or die $0: can't fetch $url\n;
Dump $file;

and when running

  for i in 1a 1h 2a 2h
  do
./lwp-dump http://www.vinc17.net/test/perl-lwp-test$i.xml \
2 perl-lwp-test$i.dump
  done

I get (see perl-lwp-test1h.dump in particular):

== perl-lwp-test1a.dump ==
SV = PV(0x194dac8) at 0x6a02d0
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x1308cd0 ?xml version=\1.0\ 
encoding=\iso-8859-1\?\nrootpost\303\203\302\251... A/root\n\0 [UTF8 
?xml version=1.0 encoding=iso-8859-1?\nrootpost\x{c3}\x{a9}... 
A/root\n]
  CUR = 71
  LEN = 80

== perl-lwp-test1h.dump ==
SV = PV(0x194dac8) at 0x6a02d0
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x13097d0 ?xml version=\1.0\ 
encoding=\iso-8859-1\?\nrootpost\303\251... A/root\n\0 [UTF8 ?xml 
version=1.0 encoding=iso-8859-1?\nrootpost\x{e9}... A/root\n]
  CUR = 69
  LEN = 80

== perl-lwp-test2a.dump ==
SV = PV(0x194dac8) at 0x6a02d0
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x1308cd0 ?xml version=\1.0\ 
encoding=\iso-8859-1\?\nrootpost\303\203\302\251... \303\203/root\n\0 
[UTF8 ?xml version=1.0 encoding=iso-8859-1?\nrootpost\x{c3}\x{a9}... 
\x{c3}/root\n]
  CUR = 72
  LEN = 80

== perl-lwp-test2h.dump ==
SV = PV(0x194dac8) at 0x6a02d0
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x1309850 ?xml version=\1.0\ 
encoding=\iso-8859-1\?\nrootpost\303\203\302\251... \303\203/root\n\0 
[UTF8 ?xml version=1.0 encoding=iso-8859-1?\nrootpost\x{c3}\x{a9}... 
\x{c3}/root\n]
  CUR = 72
  LEN = 80

Note: my examples are not HTML files, but this doesn't matter. I first
thought the problem occurred for all text/* files (e.g. text/xml, that's
why I just wrote basic XML files), but in fact only text/html seems to
be affected.

How the bug should be fixed depends on the expected behavior. However
LWP::Simple::get is not sufficiently documented. This means that the
other cases are potentially wrong too. Indeed, in lenny, I always get
a sequence of bytes (no UTF8 flag):

== perl-lwp-test1a.dump ==
SV = PVIV(0x1b1ef38) at 0x1bec568
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  IV = 0
  PV = 0x1c04130 ?xml version=\1.0\ 
encoding=\iso-8859-1\?\nrootpost\303\251... A/root\n\0
  CUR = 69
  LEN = 72

== perl-lwp-test1h.dump ==
SV = PVIV(0x166af38) at 0x1738568
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  IV = 0
  PV = 0x1750130 ?xml version=\1.0\ 
encoding=\iso-8859-1\?\nrootpost\303\251... A/root\n\0
  CUR = 69
  LEN = 72

== perl-lwp-test2a.dump ==
SV = PVIV(0x2150f38) at 0x221e568
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  IV = 0
  PV = 0x2236130 ?xml version=\1.0\ 
encoding=\iso-8859-1\?\nrootpost\303\251... \303/root\n\0
  CUR = 69
  LEN = 72

== perl-lwp-test2h.dump ==
SV = PVIV(0x1752f38) at 0x1820568
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  IV = 0
  PV = 0x1838130 ?xml version=\1.0\ 
encoding=\iso-8859-1\?\nrootpost\303\251... \303/root\n\0
  CUR = 69
  LEN = 72

and in squeeze, ditto except perl-lwp-test1h.dump, which is already
wrong:

== perl-lwp-test1a.dump ==
SV = PV(0x23ce758) at 0x1e455f0
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x23ce5b0 ?xml version=\1.0\ 
encoding=\iso-8859-1\?\nrootpost\303\251... A/root\n\0
  CUR = 69
  LEN = 72

== perl-lwp-test1h.dump ==
SV = PV(0x2afe758) at 0x25755f0
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x2d5f9f0 ?xml version=\1.0\ 
encoding=\iso-8859-1\?\nrootpost\303\251... A/root\n\0 [UTF8 ?xml 
version=1.0 encoding=iso-8859-1?\nrootpost\x{e9}... A/root\n]
  CUR = 69
  LEN = 72

== perl-lwp-test2a.dump ==
SV = PV(0x2a5d758) at 0x24d45f0
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x2a5d5b0 ?xml version=\1.0\ 
encoding=\iso-8859-1\?\nrootpost\303\251... \303/root\n\0
  CUR = 69
  LEN = 72

== perl-lwp-test2h.dump ==
SV = PV(0x28cd758) at 0x23445f0
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x2b8e0c0 ?xml version=\1.0\ 
encoding=\iso-8859-1\?\nrootpost\303\251... \303/root\n\0
  CUR = 69
  LEN = 72

A sequence of bytes is probably what one expects for files without
a HTTP charset (e.g. served as application/xml).

Also, what happens if a file is sent as text/html with UTF-8 charset,
but isn't a valid UTF-8 file?

The problem with the 1h file may come from HTTP::Message, with a
default charset guessed by content_charset(), if LWP::Simple::get
uses decoded_content from HTTP::Message with a default charset
guessed by content_charset(). Charset guessing should strictly
follow the explicit rules from