Hello everyone,
I am currently working on a little tool to produce sitemaps. I have been
using wget with the --spider option which does the job *almost* perfectly:
it performs HEAD requests first, then GET requests only when the
content-type is HTML or similar.
I currently have a test site (say test.com) set up like this:
/index.html (default page)
/robots.txt
/images
/images/image1.jpg
/images/hidden-image.jpg
So this command line:
wget -r --spider http://test.com
produces the following result:
"HEAD / HTTP/1.0" 200 - "-" "Wget/1.12 (linux-gnu)"
"GET / HTTP/1.0" 200 402 "-" "Wget/1.12 (linux-gnu)"
"GET /robots.txt HTTP/1.0" 200 38 "-" "Wget/1.12 (linux-gnu)"
"HEAD /images/image1.jpg HTTP/1.0" 200 - "http://test.com/" "Wget/1.12
(linux-gnu)"
Wget has parsed the default (index.html page) and found the file image1.jpg.
However I would like wget to also recursively read the directory
http://test.com/images that is browseable, so it will also discover
hidden-image.jpg...
But with this command line:
wget -r --spider http://test.com/images
it will list all the files contained in that images folder
So my question is this: is there a way to force wget to try browsing
*every* directory found during the crawl, starting from the root URL
(http://test.com) ?
The aim is of course to discover as many files as possible, including
those not linked from any page.
Thanks a lot for the insight.
Marj