You know, OC/nutch-84 already provides these mechanisms, i.e. via the 
DefaultFetchList class

1. Block by hostname
2. Configurable wait time by time taken to download.

And this is a good example where, if Matthias' requirements are unique, he can 
always implement a new FetchList which blocks by IP. No point trying to please 
everyone..

In Nutch-speak, I guess the FetchList has to be an extension point.

k

On Wed, 21 Sep 2005 21:07:28 +0200, Matthias Jaekle wrote:
>> So most other crawlers use the hostname, not the ip.  That's good
>> to
>>
> know.
>
> google and yahoo, Yes. The others I am not sure.
>
>> Perhaps a dynamic property would help.  If the elapsed time of
>> the previous request is some fraction of the delay then we might
>> lessen the delay.  Similarly, if it is greater or if we get 503s,
>> then we might increase it.  For example, if the fraction were .5
>> and the delay is 2 seconds, then sites which respond faster than
>> a second would get their delay decreased, and sites which respond
>> in more than a second or that return 503 would have their delay
>> increased.  Do you think this would be effective with your site?
>>
>
> Adjusting the amount of downloads dynamically according to the
> response time should be great.
>
> But where is the advantage doing this per unique name?
>
> If there is no real reason to do so, I would do it dynamically per
> IP or second level domain, but not per sub domain.
>
> Matthias
>
>
> ------------------------------------------------------- SF.Net
> email is sponsored by:
> Tame your development challenges with Apache's Geronimo App Server.
> Download it for free - -and be entered to win a 42" plasma tv or
> your very own Sony(tm)PSP.  Click here to play:
> http://sourceforge.net/geronimo.php
> _______________________________________________ Nutch-general
> mailing list [email protected]https://lists.sourceforge.net/lists/listinfo/nutch-general


Reply via email to