Re: [nutch] - http.max.delays: retry later issue?

Lukas Vlcek Tue, 20 Sep 2005 22:35:40 -0700

Hi,

I didn't have a chance to look into code yet I am probably not the
best one to answer this question but if it is changed from IP to
hostname how would it help in my example scenario described few mails
above?

As far as I undestand this problem I think there is no way how crawler
can learn what physical server structure is behind specific domain
name (or IP address) [not to say that this can dynamically change
during fetching time, correct?]. On the other hand it can very well
learn how fast is the response (and in fact this is the ONLY
information it can learn, correct?). Then, all it is about is just
question how much aggressive/polite the crawler should be. I believe
that big companies (like Google) must have much sophisticated crawler
system because they never know what it is going to face to.

How about to implement new crawler's beaviour which is more dynamic
driven as opposed to static_number_of_threads? Somethink which allows
my to define max number of threads in total and per one host and some
factor which allows me to specify that if one host/IP response is
*slow* then lower number of concurrent threads per given host or in
total. [I know it won't be that easy... :-)]
This means there will be need to do more complicated analysis during
fetching process but I think it is worth.

Just my 2 cents.

Regards,
Lukas

On 9/21/05, Doug Cutting <[EMAIL PROTECTED]> wrote:
> Doug Cutting wrote:
> > Lukas Vlcek wrote:
> >
> >> However, this leads me to the question what exactly
> >> fetcher.threads.per.host value is use for? More specifically what
> >> *host* means in Nutch configuration world?
> >
> >
> > In this case, a host is an IP address.
> 
> I've thought about this more, and wonder if perhaps this should be
> switched so that host name are blocked from simultaneous fetching rather
> than IP addresses.  I recently spoke with Carlos Castillo, author of the
> WIRE crawler (http://www.cwr.cl/projects/WIRE/) and it blocks hosts by
> name, not IP.  What do others think?
> 
> Doug
>

Re: [nutch] - http.max.delays: retry later issue?

Reply via email to