In any case, I think the end goal would be to have per-host coefficients used when generating fetchlists. For example:
maxPerHost = 1000; superSlowMaxPerHost = maxPerHost * 0.1 slowMaxPerHost = maxPerHost * 0.5 avgMaxPerHost = maxPerHost * 1.0 fastMaxPerHost = maxPerHost * 1.5 perHostCounts = new HashMap(host, AtomicInteger) HostDatum hd = hostdb.get(host) dlSpeed = hd.downloadSpeed() hostCount = perHostCounts.get(host) if (dlSpeed < 100 && hostCount > superSlowMaxPerHost) // we have enough URLs for this host return/continue else if (dlSpeed < 200 && hostCount > slowMaxPerHost) // we have enough URLs for this host return/continue else if .... perHostCounts(host, hostCount++) emit URL Do others agree that the above is the goal and that it would help with fetching efficiency by balancing the fetchlists better? I believe the above would help us get way from fetchlists like: slow.com/1 slow.com/2 fast.com/1 ... slow.com/N (N is big) fast.com/2 And get fetchlist that are more like this (only a few URLs from slow sites and more URLs from fast sites): slow.com/1 slow.com/2 fast.com/1 fast.com/2 fast.com/N (N is big) I made some HostDb progress last night, though I'm unsure what to do with hostdb.get(host) other than to load all host data into memory in a MapReduce job and do host lookups against that. Andrzej provided some pointers, but reading those at 1-2 AM doesn't work... Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > To: [email protected] > Sent: Wednesday, April 23, 2008 11:22:12 AM > Subject: Re: Fetching inefficiency > > Hi, > > > ----- Original Message ---- > > From: Andrzej Bialecki > > To: [email protected] > > Sent: Wednesday, April 23, 2008 4:23:44 AM > > Subject: Re: Fetching inefficiency > > > > [EMAIL PROTECTED] wrote: > > > Siddhartha, > > > > > > > > I think decreasing generate.max.per.host will limit the 'wait time' > > > for each fetch run, but I have a feeling that the overall time will > > > be roughly the same. As a matter of fact, it may be even higher, > > > because you'll have to run generate more times, and if your fetch > > > jobs are too short, you will be spending more time waiting on > > > MapReduce jobs (JVM instantiation, job initialization....) > > > > That's correct in case of very short jobs. In case of longer jobs and > > fetchlists consisting of many urls from the same hosts, the fetch time > > will be dominated by 'wait time'. > > > > A different point of view on the effects of generate.max.per.host is > > that it gives a better chance to smaller hosts to be included in a > > fetchlist - otherwise fetchlists would be dominated by urls from large > > hosts. So, in a sense it helps to differentiate your crawling frontier, > > with a silent assumption that N pages from X hosts is more interesting > > than the same N pages from a single host. > > Si, si! > I think even the above assumes that you have so many pages > that are ready to be fetched from large hosts, that if you let them all get > into the fetchlist, there would be no room for sites with fewer pages. > That is, it assumes -topN is being used and that N would be hit if you > didn't limit per-host-URLs with generate.max.per.host. > > However, there is also a "in-between" situation, where you have this > group of sites with lots of pages (some potentially slow), and sites with > fewer pages (the pages-per-host distribution must have the "long tail" > curve), but all together there are not enough of them to reach -topN. > > > I think that in that case limiting with generate.max.per.host won't have the > nice benefit of winder crawl frontier host distribution.... but this is > really > all > theoretical. I am actually not hitting this issue. > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
