On Wednesday, August 3, 2016 11:55:55 AM CEST Dale R. Worley wrote: > Tim Rühsen <[email protected]> writes: > > If you have a look at 'man wget'/--page-requisites, the stuff is explained > > quite well. To me it looks like you are missing --level 2. > > > > If --level 2 is not what you want. you could make your point clear by > > making up a small document tree as an example. > > I definitely don't want --level 2, because that limits how many links > the recursion can traverse. If all the links are within the > /assignments/ directory, wget should follow an unlimited number. > > Here's an outline of what I want retrieved, based on Matthew White's > listing: > > www.iana.org/ > Some or all of these files are OK, since they're likely page requisites: > www.iana.org/_css/ > www.iana.org/_css/2015.1/ > www.iana.org/_css/2015.1/print.css > www.iana.org/_css/2015.1/screen.css > www.iana.org/_img/ > www.iana.org/_img/2011.1/ > www.iana.org/_img/2011.1/icons/ > ... > www.iana.org/_js/ > www.iana.org/_js/2013.1/ > www.iana.org/_js/2013.1/iana.js > www.iana.org/_js/2013.1/jquery.js > Nothing in these directories: > www.iana.org/about/ > www.iana.org/abuse/ > Lots and lots of files in this directory: > www.iana.org/assignments/ > www.iana.org/assignments/_6lowpan-parameters/ > > www.iana.org/assignments/_6lowpan-parameters/_6lowpan-parameters.xhtml.html > www.iana.org/assignments/_support/ > www.iana.org/assignments/_support/iana-registry.css > www.iana.org/assignments/_support/jquery.js > www.iana.org/assignments/_support/sort.js > www.iana.org/assignments/aaa-parameters/ > www.iana.org/assignments/aaa-parameters/aaa-parameters-1.csv > www.iana.org/assignments/aaa-parameters/aaa-parameters.txt > www.iana.org/assignments/aaa-parameters/aaa-parameters.xhtml.html > www.iana.org/assignments/aaa-parameters/aaa-parameters.xml > www.iana.org/assignments/abfab-parameters/ > www.iana.org/assignments/abfab-parameters/abfab-parameters.txt > www.iana.org/assignments/abfab-parameters/abfab-parameters.xhtml.html > www.iana.org/assignments/abfab-parameters/abfab-parameters.xml > www.iana.org/assignments/abfab-parameters/urn-parameters.csv > ... > Nothing in these directories: > www.iana.org/dnssec/ > www.iana.org/domains/ > www.iana.org/go/ > www.iana.org/help/ > www.iana.org/numbers/ > www.iana.org/procedures/ > www.iana.org/protocols/ > www.iana.org/reports/
Sounds like "download everything from www.iana.org/assignments/ plus all page requisites on www.iana.org". Page requisites from other domains shouldn't be pulled in !? Then your first try was very close, it was basically: wget -r --no-parent --page-requisites http://www.iana.org/assignments/ index.html With -d you can see that this page is being redirected to /protocols and thus no further downloading takes place since /protocols would escape the / assignments/ directory (not allowed due to --no-parent). [It is debatable if this behavior regarding redirections should be changed or not, so feel free to open a bug report at https://savannah.gnu.org/bugs/? func=additem&group=wget.] Your are currently left with what Matthew White already suggested. Similar approach would be to extract all links from 'protocols', build a list of all referenced links and filter with e.g. (e)grep: wget -d --convert-links -r --no-parent --page-requisites http://www.iana.org/ assignments/index.html 2>&1|grep ^TO_COMPLETE|cut -d' ' -f 4 >list.txt After editing, filtering list.txt, download all the URLs including --page- requisites: wget --convert-links --page-requisites -x -i list.txt Tim
signature.asc
Description: This is a digitally signed message part.
