In any case, I think the end goal would be to have per-host coefficients
used when generating fetchlists.  For example:


maxPerHost = 1000;
superSlowMaxPerHost = maxPerHost * 0.1
slowMaxPerHost = maxPerHost * 0.5
avgMaxPerHost = maxPerHost * 1.0
fastMaxPerHost = maxPerHost * 1.5

perHostCounts = new HashMap(host, AtomicInteger)
HostDatum hd = hostdb.get(host)
dlSpeed = hd.downloadSpeed()
hostCount = perHostCounts.get(host)
if (dlSpeed < 100 && hostCount > superSlowMaxPerHost) 
  // we have enough URLs for this host
  return/continue
else if (dlSpeed < 200 && hostCount > slowMaxPerHost)
  // we have enough URLs for this host
  return/continue
else if
  ....

perHostCounts(host, hostCount++)
emit URL

Do others agree that the above is the goal and that it would help with
fetching efficiency by balancing the fetchlists better?  I believe the above
would help us get way from fetchlists like:
slow.com/1
slow.com/2
fast.com/1
...
slow.com/N (N is big)
fast.com/2

And get fetchlist that are more like this (only a few URLs from slow sites
and more URLs from fast sites):

slow.com/1
slow.com/2
fast.com/1
fast.com/2
fast.com/N (N is big)


I made some HostDb progress last night, though I'm unsure what
to do with hostdb.get(host) other than to load all host data into memory
in a MapReduce job and do host lookups against that.  Andrzej 
provided some pointers, but reading those at 1-2 AM doesn't work...

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Wednesday, April 23, 2008 11:22:12 AM
> Subject: Re: Fetching inefficiency
> 
> Hi,
> 
> 
> ----- Original Message ----
> > From: Andrzej Bialecki 
> > To: [email protected]
> > Sent: Wednesday, April 23, 2008 4:23:44 AM
> > Subject: Re: Fetching inefficiency
> > 
> > [EMAIL PROTECTED] wrote:
> > > Siddhartha,
> > > 
> > 
> > > I think decreasing generate.max.per.host will limit the 'wait time'
> > > for each fetch run, but I have a feeling that the overall time will
> > > be roughly the same.  As a matter of fact, it may be even higher,
> > > because you'll have to run generate more times, and if your fetch
> > > jobs are too short, you will be spending more time waiting on
> > > MapReduce jobs (JVM instantiation, job initialization....)
> > 
> > That's correct in case of very short jobs. In case of longer jobs and 
> > fetchlists consisting of many urls from the same hosts, the fetch time 
> > will be dominated by 'wait time'.
> > 
> > A different point of view on the effects of generate.max.per.host is 
> > that it gives a better chance to smaller hosts to be included in a 
> > fetchlist - otherwise fetchlists would be dominated by urls from large 
> > hosts. So, in a sense it helps to differentiate your crawling frontier, 
> > with a silent assumption that N pages from X hosts is more interesting 
> > than the same N pages from a single host.
> 
> Si, si!
> I think even the above assumes that you have so many pages
> that are ready to be fetched from large hosts, that if you let them all get
> into the fetchlist, there would be no room for sites with fewer pages.
> That is, it assumes -topN is being used and that N would be hit if you
> didn't limit per-host-URLs with generate.max.per.host.
> 
> However, there is also a "in-between" situation, where you have this
> group of sites with lots of pages (some potentially slow), and sites with
> fewer pages (the pages-per-host distribution must have the "long tail"
> curve), but all together there are not enough of them to reach -topN.
> 
> 
> I think that in that case limiting with generate.max.per.host won't have the
> nice benefit of winder crawl frontier host distribution.... but this is 
> really 
> all
> theoretical.  I am actually not hitting this issue.
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Reply via email to