[Bug-wget] fetching directory listing in spider mode

Marjorie Thu, 22 Aug 2013 00:11:21 -0700

Hello everyone,

I am currently working on a little tool to produce sitemaps. I have beenusing wget with the --spider option which does the job *almost* perfectly:it performs HEAD requests first, then GET requests only when thecontent-type is HTML or similar.


I currently have a test site (say test.com) set up like this:

/index.html (default page)
/robots.txt
/images
    /images/image1.jpg
    /images/hidden-image.jpg

So this command line:
    wget  -r --spider http://test.com
produces the following result:

"HEAD / HTTP/1.0" 200 - "-" "Wget/1.12 (linux-gnu)"
"GET / HTTP/1.0" 200 402 "-" "Wget/1.12 (linux-gnu)"
"GET /robots.txt HTTP/1.0" 200 38 "-" "Wget/1.12 (linux-gnu)"

"HEAD /images/image1.jpg HTTP/1.0" 200 - "http://test.com/"; "Wget/1.12(linux-gnu)"


Wget has parsed the default (index.html page) and found the file image1.jpg.

However I would like wget to also recursively read the directoryhttp://test.com/images that is browseable, so it will also discoverhidden-image.jpg...


But with this command line:
    wget  -r --spider http://test.com/images
it will list all the files contained in that images folder

So my question is this: is there a way to force wget to try browsing*every* directory found during the crawl, starting from the root URL(http://test.com) ?

The aim is of course to discover as many files as possible, includingthose not linked from any page.


Thanks a lot for the insight.

Marj

[Bug-wget] fetching directory listing in spider mode

Reply via email to