It seems to me that wget should not reuse a connection from one host to access another (even if those hosts share an IP address). I suspect the current behavior is accidental rather than intentional.
Tony -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Ryan Rawdon Sent: Tuesday, April 17, 2012 9:16 AM To: [email protected] Subject: [Bug-wget] persistence with multiple hostnames I was speaking with Micah on IRC today regarding a behavior in wget which is different than curl and most or all browsers. Generally HTTP clients do not use a given persistent connection for more than one hostname, which is why tricks work like spreading static content across multiple name-based vhosts on the same IP address to encourage more parallelization in the fetching of a page's static elements. However, wget appears to use persistent connections for multiple hostnames (see below). In the case below, a connection is opened to soldat.pl which 302s to a new hostname. Wget resolves the new hostname and selects the same address, and decides to reuse the existing connection to this IP address. The RFC does not appear to address the re-use of persistent connections with regard to hostname, so the behavior is permissible (and fine from a protocol standpoint since Host is specified with each request). The problem stems from usage of privilege separation between virtualhosts. In the case below, before I fixed it today, wget was receiving 403 on the second request because the user that owned this fd on the server side did not have privileges to access the content for the soldat.thd.vg vhost. This is probably a reproducible behavior with any page fetched with wget that 302s between two privilege-separated vhosts on the same server, or scraping a page with elements from two or more hosts on the same IP address. This behavior appears to be permissible based on the RFC, so this is more a discussion of whether this is intended behavior in wget, a bug, or an opportunity to behave more like curl and every day GUI browsers. Micah took a quick look over the source (or was previously familiar with it), and it sounds like there may be checks in place which should have prevented this, however I did look to confirm. nova-dhcp-host111:tmp ryan$ wget http://soldat.pl --2012-04-17 11:57:25-- http://soldat.pl/ Resolving soldat.pl (soldat.pl)... 2607:fd50:1:91b0::50:1d8, 192.168.152.5 Connecting to soldat.pl (soldat.pl)|2607:fd50:1:91b0::50:1d8|:80... connected. HTTP request sent, awaiting response... 302 Found Location: http://soldat.thd.vg/ [following] --2012-04-17 11:57:25-- http://soldat.thd.vg/ Resolving soldat.thd.vg (soldat.thd.vg)... 2607:fd50:1:91b0::50:1d8, 192.168.152.5 Reusing existing connection to soldat.pl:80. HTTP request sent, awaiting response... 302 Found Location: http://soldat.thd.vg/en/ [following] --2012-04-17 11:57:26-- http://soldat.thd.vg/en/ Reusing existing connection to soldat.pl:80. HTTP request sent, awaiting response... 200 OK Cookie coming from soldat.thd.vg attempted to set domain to soldat.thd.vg Cookie coming from soldat.thd.vg attempted to set domain to soldat.thd.vg Cookie coming from soldat.thd.vg attempted to set domain to soldat.thd.vg Length: unspecified [text/html] Here is the original report from a userwhich shows the 403: snide@vooserver-vps:~$ wget www.soldat.pl --2012-04-17 11:50:29-- http://www.soldat.pl/ Resolving www.soldat.pl... 67.23.118.186, 2607:fd50:1:91b0::50:1d8 Connecting to www.soldat.pl|67.23.118.186|:80... connected. HTTP request sent, awaiting response... 302 Found Location: http://soldat.thd.vg/ [following] --2012-04-17 11:50:29-- http://soldat.thd.vg/ Resolving soldat.thd.vg... 67.23.118.186, 2607:fd50:1:91b0::50:1d8 Reusing existing connection to www.soldat.pl:80. HTTP request sent, awaiting response... 403 Forbidden 2012-04-17 11:50:29 ERROR 403: Forbidden. snide@vooserver-vps:~$ wget -6 www.soldat.pl --2012-04-17 11:50:39-- http://www.soldat.pl/ Resolving www.soldat.pl... 2607:fd50:1:91b0::50:1d8 Connecting to www.soldat.pl|2607:fd50:1:91b0::50:1d8|:80... connected. HTTP request sent, awaiting response... 302 Found Location: http://soldat.thd.vg/ [following] --2012-04-17 11:50:39-- http://soldat.thd.vg/ Resolving soldat.thd.vg... 2607:fd50:1:91b0::50:1d8 Reusing existing connection to www.soldat.pl:80. HTTP request sent, awaiting response... 403 Forbidden 2012-04-17 11:50:39 ERROR 403: Forbidden.
