In regard to my difficulties with recursively retrieving http://www.iana.org/assignments/index.html: I discovered that one URL (http://www.iana.org/assignments/forces/forces.xhtml) is pointed to by no less than three different URLs:
http://www.iana.org/assignments/forces/forces.xhtml http://www.iana.org/assignments/forces-parameters/forces-parameters.xhtml http://www.iana.org/assignments/forces The first is the proper URL for it, and the second two are redirected to the first URL. There are several other occurrences of this situation. And I discovered that if I specify --trust-server-names, then wget will realize that the redirection URL can be retrieved once, and links to the other two URLs can be directed to that one file. Without --trust-server-names, wget considers all three URLs to be different, despite that they are redirected to the same URL, and dutifully stores essentially the same content three times. With --trust-server-names, wget understands that all three URLs are the same. It turns out that this provides me with a much better mirror of the web site. I've attached a patch that improves the documentation of --trust-server-names, to clarify that if -nd is not in effect, then the file name is constructed from the entire redirection URL, not just the last component. (--trust-server-names is also mentioned in doc/metalink-standard.txt, but that text does not seem to me to have the problem the patch corrects.) Dale
>From 740c68d4d820334362dc93ce5c31b9d742f12558 Mon Sep 17 00:00:00 2001 From: "Dale R. Worley" <[email protected]> Date: Wed, 2 Nov 2016 12:14:46 -0400 Subject: [PATCH] Improve documentation of --trust-server-names. --- doc/wget.texi | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/doc/wget.texi b/doc/wget.texi index 91219e5..3632fd1 100644 --- a/doc/wget.texi +++ b/doc/wget.texi @@ -1700,9 +1700,11 @@ with a http status code that indicates error. @cindex Trust server names @item --trust-server-names -If this is set to on, on a redirect the last component of the -redirection URL will be used as the local file name. By default it is -used the last component in the original URL. +If this is set, on a redirect, the local file name will be based +on the redirection URL. By default the local file name is is based on +the original URL. When doing recursive retrieving this can be helpful +because in many web sites redirected URLs correspond to an underlying +file structure, while link URLs do not. @cindex authentication @item --auth-no-challenge @@ -3261,8 +3263,8 @@ Turn on recognition of the (non-standard) @samp{Content-Disposition} HTTP header---if set to @samp{on}, the same as @samp{--content-disposition}. @item trust_server_names = on/off -If set to on, use the last component of a redirection URL for the local -file name. +If set to on, construct the local file name from redirection URLs +rather than original URLs. @item continue = on/off If set to on, force continuation of preexistent partially retrieved -- 1.8.3.1
