[Bug-wget] Recursive download and `trivial' redirects

Maxim Kuznetsov Sun, 24 Nov 2013 23:32:54 -0800

Hello there,

Retrieving a directory (or some `clean' URL) without a slash at the
end of a URL -- e.g. example.com/foo -- web servers often add an
end-slash by a redirect example.com/foo -> example.com/foo/.  I'll
hereafter call such redirects `trivial'.


The problem is that some websites (e.g. ocw.mit.edu) use links without
end-slash.  This means that when Wget (with -r) retrieves
example.com/foo, it'll save the content to the file `foo' regardless
of the redirect.  Then when Wget reads `foo' and sees a link to
example.com/foo/file.bar, it'll delete a regular file `foo' and create
a directory with the same name (by the function mkalldirs(), see
url.c:1220).  Therefore we lose the entire page.

Example of reproducer (GNU Wget 1.14.97-1221):
$ wget -d -r --no-parent
http://ocw.mit.edu/courses/mathematics/18-100b-analysis-i-fall-2010/
2>&1 | grep "directory danger"
Removing ocw.mit.edu/courses/<skipped>/assignments because of directory danger!
Removing ocw.mit.edu/courses/<skipped>/readings-notes because of
directory danger!
Removing ocw.mit.edu/courses/<skipped>/study-materials because of
directory danger!

--trust-server-names solves this problem, but it seems to be not
obvious for a user to use it every time together with -r, to say
nothing of security reasons.

Does it sound reasonable to handle such `trivial' redirects (that
simply add an end-slash) as a special case regardless of
`trust-server-names'?

Thanks

--
Maxim Kuznetsov

[Bug-wget] Recursive download and `trivial' redirects

Reply via email to