Tim, Almost bang on. While I hadn't thought of the case where the domain name itself changes. That brings a thought on the same lines. If the server responds with a 301/2 Redirect, the user probably does expect wget to download the redirected website. Say I execute: $wget -r www.example.com And the server responds with a 302 Found and a Location Header to: http://example.iana.org
In this case I had indeed intended to download the location pointed to by the Location Header. Maybe if the first response is a Redirect, then wget should either print a verbose message or change the parent domain name. However, my question was much more specific, if the server redirects to a domain which matches www.<old domain name> then shouldn't wget just accept it and refresh the parent domain name that it holds? We shouldn't ask the user to add extra -D -H options (Not everyone does RTFM) for a common scenario. On Thu, May 2, 2013 at 9:00 PM, Tim Ruehsen <[email protected]> wrote: > Darshit, I guess you are talking about redirection. > > That is 'wget -r gnu.org' is being redirected to www.gnu.org (via Location > header). Wget now follows the redirection, but only downloads index.html > since > all included URLs in index.html refer to www.gnu.org. But we requested > stuff > from gnu.org. > > That's why only one file (index.html) is downloaded. > But that is not what the user expects... > > The user could work around it using the -D and/or -H option, but then he > has > to know about the redirection before he starts wget. Not everyone has the > understanding to find that out. > > Should wget behaviour change (default or using a new option) or should we > leave it and print out verbose message that makes it clear to the user. > > Regards, Tim > > Am Thursday 02 May 2013 schrieb Micah Cowan: > > I believe you want -H -D gnu.org. That's what it's for. Wget doesn't > > know which hostnames under a domain should be allowed and which should > > not be (do you want images.gnu.org? git.gnu.org? lists.gnu.org?), so > > turns 'em all off unless you ask for them explicitly. > > > > HTH, > > -mjc > > > > On Thu, May 2, 2013 at 4:52 AM, Darshit Shah <[email protected]> wrote: > > > I should have been more clear. --span-hosts will enqueue the other > files, > > > but it will also enqueue files from other hosts. I wish to recursively > > > download a website but not other sites that it links to. > > > > > > Of course I could add --accept-regex / --reject-regex options to > prevent > > > wget from wandering onto other hosts. But shouldn't the default > > > --recursive option simply handle cases where a www is either added or > > > removed? Or is there any scenario that I am missing which would cause > > > undesirable effects here? > > > > > > On Thu, May 2, 2013 at 5:22 PM, Giuseppe Scrivano <[email protected]> > wrote: > > >> Darshit Shah <[email protected]> writes: > > >> > When using the --recursive command with wget, there seems to be a > > >> > small issue with the logic that decides whether to enqueue a file to > > >> > the downloads list or not. > > >> > > > >> > By default wget downloads files only from the same host. However, > this > > >> > causes a problem when the target hostname changes thus: > > >> > parent: gnu.org > > >> > target: www.gnu.org > > >> > > > >> > This issue causes wget to stop after just one download on a lot of > > >> > sites. I'm not sure if this exists in the older or release since I > > >> > only have the development version installed. > > >> > > >> does --span-hosts fix this scenario for you? > > >> > > >> Cheers, > > >> Giuseppe > > > > > > -- > > > Thanking You, > > > Darshit Shah > > > Research Lead, Code Innovation > > > Kill Code Phobia. > > > B.E.(Hons.) Mechanical Engineering, '14. BITS-Pilani > > Mit freundlichem Gruß > > Tim Rühsen > -- Thanking You, Darshit Shah Research Lead, Code Innovation Kill Code Phobia. B.E.(Hons.) Mechanical Engineering, '14. BITS-Pilani
