Although I've used it mostly for mirroring sites, I believe httrack has --spider option.
On Sat, Sep 25, 2010 at 7:46 AM, Robin Wood <[email protected]> wrote: > On 25 September 2010 02:46, Adrian Crenshaw <[email protected]> wrote: > > Hi all, > > I'm looking at some of the tools in BT4R1, and will be looking at > what > > Samurai WTF has to offer once I finish downloading the latest version. > I'm > > looking for some sort of spider that lets me do the following: > > > > 1. Follow every link on a page, even onto other domains, as long as the > top > > level domain name is the same (edu, com, cn, whatever) > > 2. For every page it visits, it collect the file names of all resources. > > 3. The headers so I can see the server version. > > 4. Grab the robots .txt if possible. > > > > Any ideas on the best tool for the job, or do I need to roll my own? > > > If you want to roll your own you can take my CeWL code and check the > spider, I do a full spider and check whether you are on the same site > or off and grab all the documents, you should easily be able to modify > this to do what you want. > > Robin > _______________________________________________ > Pauldotcom mailing list > [email protected] > http://mail.pauldotcom.com/cgi-bin/mailman/listinfo/pauldotcom > Main Web Site: http://pauldotcom.com >
_______________________________________________ Pauldotcom mailing list [email protected] http://mail.pauldotcom.com/cgi-bin/mailman/listinfo/pauldotcom Main Web Site: http://pauldotcom.com
