Hi, I didn't have a chance to look into code yet I am probably not the best one to answer this question but if it is changed from IP to hostname how would it help in my example scenario described few mails above?
As far as I undestand this problem I think there is no way how crawler can learn what physical server structure is behind specific domain name (or IP address) [not to say that this can dynamically change during fetching time, correct?]. On the other hand it can very well learn how fast is the response (and in fact this is the ONLY information it can learn, correct?). Then, all it is about is just question how much aggressive/polite the crawler should be. I believe that big companies (like Google) must have much sophisticated crawler system because they never know what it is going to face to. How about to implement new crawler's beaviour which is more dynamic driven as opposed to static_number_of_threads? Somethink which allows my to define max number of threads in total and per one host and some factor which allows me to specify that if one host/IP response is *slow* then lower number of concurrent threads per given host or in total. [I know it won't be that easy... :-)] This means there will be need to do more complicated analysis during fetching process but I think it is worth. Just my 2 cents. Regards, Lukas On 9/21/05, Doug Cutting <[EMAIL PROTECTED]> wrote: > Doug Cutting wrote: > > Lukas Vlcek wrote: > > > >> However, this leads me to the question what exactly > >> fetcher.threads.per.host value is use for? More specifically what > >> *host* means in Nutch configuration world? > > > > > > In this case, a host is an IP address. > > I've thought about this more, and wonder if perhaps this should be > switched so that host name are blocked from simultaneous fetching rather > than IP addresses. I recently spoke with Carlos Castillo, author of the > WIRE crawler (http://www.cwr.cl/projects/WIRE/) and it blocks hosts by > name, not IP. What do others think? > > Doug >
