On Donnerstag, 4. August 2016 11:35:58 CEST Dale R. Worley wrote: > Tim Ruehsen <[email protected]> writes: > > Sounds like "download everything from www.iana.org/assignments/ plus all > > page requisites on www.iana.org". Page requisites from other domains > > shouldn't be pulled in !? > > > > Then your first try was very close, it was basically: > > wget -r --no-parent --page-requisites http://www.iana.org/assignments/ > > index.html > > > > With -d you can see that this page is being redirected to /protocols and > > thus no further downloading takes place since /protocols would escape the > > / assignments/ directory (not allowed due to --no-parent). > > I'm getting something different than that... > > First off, let's drop --page-requisites. That seems to be working > exactly as I want it, and it just complicates the discussion. > > I'm also using wget 1.16.1, which is a couple of years old. > > If I run the command quoted above, I get output which shows the > redirection happening, and the file is fetched successfully: > > [Quote characters ASCIIized.] > > $ wget -r --no-parent http://www.iana.org/assignments/index.html > --2016-08-04 11:22:48-- http://www.iana.org/assignments/index.html > Resolving www.iana.org (www.iana.org)... 192.0.32.8, 2620:0:2d0:200::8 > Connecting to www.iana.org (www.iana.org)|192.0.32.8|:80... connected. > HTTP request sent, awaiting response... 302 Found > Location: /protocols [following] > --2016-08-04 11:22:48-- http://www.iana.org/protocols > Reusing existing connection to www.iana.org:80. > HTTP request sent, awaiting response... 200 OK > Length: unspecified [text/html] > Saving to: 'www.iana.org/assignments/index.html' > > www.iana.org/assign [ <=> ] 727.79K 578KB/s in > 1.3s > > 2016-08-04 11:22:52 (578 KB/s) - 'www.iana.org/assignments/index.html' > saved [745252] > > FINISHED --2016-08-04 11:22:52-- > Total wall clock time: 4.7s > Downloaded: 1 files, 728K in 1.3s (578 KB/s) > $ ls -lR . > .: > total 4 > drwxr-xr-x. 3 worley worley 4096 Aug 4 11:22 www.iana.org > > ./www.iana.org: > total 4 > drwxr-xr-x. 2 worley worley 4096 Aug 4 11:22 assignments > > ./www.iana.org/assignments: > total 728 > -rw-r--r--. 1 worley worley 745252 Aug 4 11:22 index.html > $ > > I can argue from the wording of the man page that this is correct, as > --no-parent is described as "Do not ever ascend to the parent directory > when retrieving recursively." > > What *seems* to be happening is that index.html is fetched, but its > links are not fetched recursively, despite the -r and qualifying under > --no-parent. E.g., line 23441 of that file is > > <td><a > href="/assignments/yang-parameters/yang-parameters.xhtml#yang-parameters-1" > >YANG Module Names</a></td> > > which specifies a target URL of > http://www.iana.org//assignments/yang-parameters/yang-parameters.xhtml. > And yet, that file is not fetched. > > > OK, using -d shows what the internal logic is: After fetching > index.html, the wget output is: > > 2016-08-04 11:31:13 (576 KB/s) - 'www.iana.org/assignments/index.html' > saved [745252] > > Deciding whether to enqueue "http://www.iana.org/protocols". > Going to "" would escape "assignments" with no_parent on. > Decided NOT to load it. > Redirection "http://www.iana.org/protocols" failed the test. >
This is basically what I said, sorry being not clear. > I'm going to have to think about that, as the behavior is rather > counter-intuitive. It seems to me that if wget is willing to *fetch* a > page, it should look at the links on the page for potential recursion. This is what I meant with: [It is debatable if this behavior regarding redirections should be changed or not, so feel free to open a bug report at https://savannah.gnu.org/bugs/? func=additem&group=wget.] If you or someone comes up with a patch, that would be very nice. Or just open a bug report, so it won't be forgotten. Tim
signature.asc
Description: This is a digitally signed message part.
