Matthew Nuzum wrote:
>
> Ht://Dig uses up nearly a gig of my bandwidth every month. It tends to
> lap itself if I run it daily, so I've started running it every other
> day. Otherwise it runs great.
>
> I am now presented with the task of needing to mirror sites, in addition
> to index them with my search engine. I shudder to think of the
> resources this will use; both on my web servers to be mirrored/indexed
> and my bandwidth. Disk space, is not a big concern to me though.
>
> Is it possible to create a mirror of a site using the information in
> ht://dig's databases so that I can save the extra effort of mirroring?
>
> I was using rsync to keep my bandwidth low, but now I need to switch to
> something that works like wget so that I can get static html snapshots
> instead of the actual cgi/php/asp source pages.
In order to keep bandwidth usage low, you can
(1) Use a caching proxy like squid when mirroring sites with wget
or similar tools (maybe even with rsync, but I haven't tried
that yet)
(2) Use the host name aliasing features of ht://Dig which allows
you to index the (mirrored) pages locally and still provide
search services for any host that mirrors the respective sites.
Caveats: You cannot use wget or similar tools with dynamic pages
that
(a) use HTTP post request methods
(b) use HTTP get request methods with arguments
(b) use cookies (shops etc)
However, if you have to rsync the pages to be mirrors, you can
still index locally.
hth,
Torsten
--
InWise - Wirtschaftlich-Wissenschaftlicher Internet Service GmbH
Waldhofstra�e 14 Tel: +49-4101-403605
D-25474 Ellerbek Fax: +49-4101-403606
E-Mail: [EMAIL PROTECTED] Internet: http://www.inwise.de
_______________________________________________________________
Have big pipes? SourceForge.net is looking for download mirrors. We supply
the hardware. You get the recognition. Email Us: [EMAIL PROTECTED]
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html