Package: libwww-perl Version: 6.02-1 Severity: normal Tags: upstream This bug report is more or less what I gave on
https://rt.cpan.org/Public/Bug/Display.html?id=69393 with some additional information concerning Debian. When a file declared as iso-8859-1 and served as text/html is also a valid UTF-8 file, LWP::Simple::get from libwww-perl 6.02 regards it as a UTF-8 encoded file. This is incorrect. For instance, with lwp-dump being #!/usr/bin/env perl use strict; use Devel::Peek; use LWP::Simple; @ARGV == 1 or die "Usage: $0 <URL>\n"; my $url = shift; my $file = LWP::Simple::get($url); defined $file or die "$0: can't fetch $url\n"; Dump $file; and when running for i in 1a 1h 2a 2h do ./lwp-dump http://www.vinc17.net/test/perl-lwp-test$i.xml \ 2> perl-lwp-test$i.dump done I get (see perl-lwp-test1h.dump in particular): ==> perl-lwp-test1a.dump <== SV = PV(0x194dac8) at 0x6a02d0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x1308cd0 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\203\302\251... A</root>\n"\0 [UTF8 "<?xml version="1.0" encoding="iso-8859-1"?>\n<root>post\x{c3}\x{a9}... A</root>\n"] CUR = 71 LEN = 80 ==> perl-lwp-test1h.dump <== SV = PV(0x194dac8) at 0x6a02d0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x13097d0 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\251... A</root>\n"\0 [UTF8 "<?xml version="1.0" encoding="iso-8859-1"?>\n<root>post\x{e9}... A</root>\n"] CUR = 69 LEN = 80 ==> perl-lwp-test2a.dump <== SV = PV(0x194dac8) at 0x6a02d0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x1308cd0 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\203\302\251... \303\203</root>\n"\0 [UTF8 "<?xml version="1.0" encoding="iso-8859-1"?>\n<root>post\x{c3}\x{a9}... \x{c3}</root>\n"] CUR = 72 LEN = 80 ==> perl-lwp-test2h.dump <== SV = PV(0x194dac8) at 0x6a02d0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x1309850 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\203\302\251... \303\203</root>\n"\0 [UTF8 "<?xml version="1.0" encoding="iso-8859-1"?>\n<root>post\x{c3}\x{a9}... \x{c3}</root>\n"] CUR = 72 LEN = 80 Note: my examples are not HTML files, but this doesn't matter. I first thought the problem occurred for all text/* files (e.g. text/xml, that's why I just wrote basic XML files), but in fact only text/html seems to be affected. How the bug should be fixed depends on the expected behavior. However LWP::Simple::get is not sufficiently documented. This means that the other cases are potentially wrong too. Indeed, in lenny, I always get a sequence of bytes (no UTF8 flag): ==> perl-lwp-test1a.dump <== SV = PVIV(0x1b1ef38) at 0x1bec568 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) IV = 0 PV = 0x1c04130 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\251... A</root>\n"\0 CUR = 69 LEN = 72 ==> perl-lwp-test1h.dump <== SV = PVIV(0x166af38) at 0x1738568 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) IV = 0 PV = 0x1750130 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\251... A</root>\n"\0 CUR = 69 LEN = 72 ==> perl-lwp-test2a.dump <== SV = PVIV(0x2150f38) at 0x221e568 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) IV = 0 PV = 0x2236130 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\251... \303</root>\n"\0 CUR = 69 LEN = 72 ==> perl-lwp-test2h.dump <== SV = PVIV(0x1752f38) at 0x1820568 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) IV = 0 PV = 0x1838130 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\251... \303</root>\n"\0 CUR = 69 LEN = 72 and in squeeze, ditto except perl-lwp-test1h.dump, which is already wrong: ==> perl-lwp-test1a.dump <== SV = PV(0x23ce758) at 0x1e455f0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x23ce5b0 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\251... A</root>\n"\0 CUR = 69 LEN = 72 ==> perl-lwp-test1h.dump <== SV = PV(0x2afe758) at 0x25755f0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x2d5f9f0 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\251... A</root>\n"\0 [UTF8 "<?xml version="1.0" encoding="iso-8859-1"?>\n<root>post\x{e9}... A</root>\n"] CUR = 69 LEN = 72 ==> perl-lwp-test2a.dump <== SV = PV(0x2a5d758) at 0x24d45f0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x2a5d5b0 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\251... \303</root>\n"\0 CUR = 69 LEN = 72 ==> perl-lwp-test2h.dump <== SV = PV(0x28cd758) at 0x23445f0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x2b8e0c0 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\251... \303</root>\n"\0 CUR = 69 LEN = 72 A sequence of bytes is probably what one expects for files without a HTTP charset (e.g. served as application/xml). Also, what happens if a file is sent as text/html with UTF-8 charset, but isn't a valid UTF-8 file? The problem with the 1h file may come from HTTP::Message, with a default charset guessed by content_charset(), if LWP::Simple::get uses decoded_content from HTTP::Message with a default charset guessed by content_charset(). Charset guessing should strictly follow the explicit rules from http://www.w3.org/TR/REC-html40/charset.html#spec-char-encoding to avoid inconsistencies like here. -- System Information: Debian Release: wheezy/sid APT prefers unstable APT policy: (500, 'unstable'), (500, 'testing'), (500, 'stable'), (1, 'experimental') Architecture: amd64 (x86_64) Kernel: Linux 2.6.39-2-amd64 (SMP w/2 CPU cores) Locale: LANG=POSIX, LC_CTYPE=en_US.ISO8859-1 (charmap=ISO-8859-1) Shell: /bin/sh linked to /bin/dash Versions of packages libwww-perl depends on: ii ca-certificates 20110502 Common CA certificates ii libencode-locale-perl 1.02-1 utility to determine the locale en ii libfile-listing-perl 6.01-1 module to parse directory listings ii libhtml-parser-perl 3.68-1+b1 collection of modules that parse H ii libhtml-tagset-perl 3.20-2 Data tables pertaining to HTML ii libhtml-tree-perl 4.2-1 Perl module to represent and creat ii libhttp-cookies-perl 6.00-2 HTTP cookie jars ii libhttp-date-perl 6.00-1 module of date conversion routines ii libhttp-message-perl 6.01-1 perl interface to HTTP style messa ii libhttp-negotiate-perl 6.00-2 implementation of content negotiat ii liblwp-mediatypes-perl 6.01-1 module to guess media type for a f ii liblwp-protocol-https-perl 6.02-1 https driver for LWP::UserAgent ii libnet-http-perl 6.01-1 module providing low-level HTTP co ii liburi-perl 1.58-1 module to manipulate and access UR ii libwww-robotrules-perl 6.01-1 database of robots.txt-derived per ii netbase 4.46 Basic TCP/IP networking system ii perl 5.12.4-1 Larry Wall's Practical Extraction Versions of packages libwww-perl recommends: ii libauthen-ntlm-perl 1.08-1 authentication module for NTLM ii libhtml-form-perl 6.00-1 module that represents an HTML for pn libhtml-format-perl <none> (no description available) ii libhttp-daemon-perl 6.00-1 simple http server class ii libmailtools-perl 2.08-1 Manipulate email in perl programs libwww-perl suggests no packages. -- no debconf information -- To UNSUBSCRIBE, email to [email protected] with a subject of "unsubscribe". Trouble? Contact [email protected]

