Hi,
----- Original Message ---- > From: Andrzej Bialecki <[EMAIL PROTECTED]> > To: [email protected] > Sent: Wednesday, April 23, 2008 4:23:44 AM > Subject: Re: Fetching inefficiency > > [EMAIL PROTECTED] wrote: > > Siddhartha, > > > > > I think decreasing generate.max.per.host will limit the 'wait time' > > for each fetch run, but I have a feeling that the overall time will > > be roughly the same. As a matter of fact, it may be even higher, > > because you'll have to run generate more times, and if your fetch > > jobs are too short, you will be spending more time waiting on > > MapReduce jobs (JVM instantiation, job initialization....) > > That's correct in case of very short jobs. In case of longer jobs and > fetchlists consisting of many urls from the same hosts, the fetch time > will be dominated by 'wait time'. > > A different point of view on the effects of generate.max.per.host is > that it gives a better chance to smaller hosts to be included in a > fetchlist - otherwise fetchlists would be dominated by urls from large > hosts. So, in a sense it helps to differentiate your crawling frontier, > with a silent assumption that N pages from X hosts is more interesting > than the same N pages from a single host. Si, si! I think even the above assumes that you have so many pages that are ready to be fetched from large hosts, that if you let them all get into the fetchlist, there would be no room for sites with fewer pages. That is, it assumes -topN is being used and that N would be hit if you didn't limit per-host-URLs with generate.max.per.host. However, there is also a "in-between" situation, where you have this group of sites with lots of pages (some potentially slow), and sites with fewer pages (the pages-per-host distribution must have the "long tail" curve), but all together there are not enough of them to reach -topN. I think that in that case limiting with generate.max.per.host won't have the nice benefit of winder crawl frontier host distribution.... but this is really all theoretical. I am actually not hitting this issue. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
