Although I've used it mostly for mirroring sites, I believe httrack has
--spider option.

On Sat, Sep 25, 2010 at 7:46 AM, Robin Wood <[email protected]> wrote:

> On 25 September 2010 02:46, Adrian Crenshaw <[email protected]> wrote:
> > Hi all,
> >     I'm looking at some of the tools in BT4R1, and will be looking at
> what
> > Samurai WTF has to offer once I finish downloading the latest version.
> I'm
> > looking for some sort of spider that lets me do the following:
> >
> > 1. Follow every link on a page, even onto other domains, as long as the
> top
> > level domain name is the same (edu, com, cn, whatever)
> > 2. For every page it visits, it collect the file names of all resources.
> > 3. The headers so I can see the server version.
> > 4. Grab the robots .txt if possible.
> >
> > Any ideas on the best tool for the job, or do I need to roll my own?
> >
> If you want to roll your own you can take my CeWL code and check the
> spider, I do a full spider and check whether you are on the same site
> or off and grab all the documents, you should easily be able to modify
> this to do what you want.
>
> Robin
> _______________________________________________
> Pauldotcom mailing list
> [email protected]
> http://mail.pauldotcom.com/cgi-bin/mailman/listinfo/pauldotcom
> Main Web Site: http://pauldotcom.com
>
_______________________________________________
Pauldotcom mailing list
[email protected]
http://mail.pauldotcom.com/cgi-bin/mailman/listinfo/pauldotcom
Main Web Site: http://pauldotcom.com

Reply via email to