[htdig] How to handle thousands of top-level URLs?

Russell Turpin Tue, 10 Apr 2001 07:48:17 -0700

The configuration file for htdig seems geared to the idea of a 
few top-level URLs, with lots of pages underneath these. We're
working on an application that will have thousands of top-level
URLs. Most of these will have only a few pages underneath them.
The total number of pages likely is between 10,000 and a 100,000,
which seems in normal range. It's just that we have a forest of
small shrubs (thousands times dozens), instead of a few huge 
sequoias (dozens times thousands).

(1) Obviously, we can automatically generate a config file that
lists thousands of start URLs. Will this create performance
problems, i.e., does it take significantly longer to dig though
thousands of small shrubs, than through a few huge trees?  Is 
this kind of use intended?

(2) Has anyone used htdig in this fashion? 

(3) Alternatively, what web crawlers integrate well with htmerge?

(4) Intuitively, this is a conern only for htdig, not htmerge or
htsearch. Is there any reason I should question this intuition?

Thanks!

Russell


_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

[htdig] How to handle thousands of top-level URLs?

Reply via email to