You know, OC/nutch-84 already provides these mechanisms, i.e. via the DefaultFetchList class
1. Block by hostname 2. Configurable wait time by time taken to download. And this is a good example where, if Matthias' requirements are unique, he can always implement a new FetchList which blocks by IP. No point trying to please everyone.. In Nutch-speak, I guess the FetchList has to be an extension point. k On Wed, 21 Sep 2005 21:07:28 +0200, Matthias Jaekle wrote: >> So most other crawlers use the hostname, not the ip. That's good >> to >> > know. > > google and yahoo, Yes. The others I am not sure. > >> Perhaps a dynamic property would help. If the elapsed time of >> the previous request is some fraction of the delay then we might >> lessen the delay. Similarly, if it is greater or if we get 503s, >> then we might increase it. For example, if the fraction were .5 >> and the delay is 2 seconds, then sites which respond faster than >> a second would get their delay decreased, and sites which respond >> in more than a second or that return 503 would have their delay >> increased. Do you think this would be effective with your site? >> > > Adjusting the amount of downloads dynamically according to the > response time should be great. > > But where is the advantage doing this per unique name? > > If there is no real reason to do so, I would do it dynamically per > IP or second level domain, but not per sub domain. > > Matthias > > > ------------------------------------------------------- SF.Net > email is sponsored by: > Tame your development challenges with Apache's Geronimo App Server. > Download it for free - -and be entered to win a 42" plasma tv or > your very own Sony(tm)PSP. Click here to play: > http://sourceforge.net/geronimo.php > _______________________________________________ Nutch-general > mailing list [email protected] > https://lists.sourceforge.net/lists/listinfo/nutch-general
