Hi Todd, This sounds good. I think we've all see the problem you are describing. You can see something related at: - https://issues.apache.org/jira/browse/NUTCH-629 - https://issues.apache.org/jira/browse/NUTCH-628
It would be great if you could incorporate any of the good ideas from the above into your work Otis P.S. Where is the JIRA issue you are referring to? https://issues.apache.org/jira/browse/NUTCH-669 ? +1! -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Todd Lipcon <[EMAIL PROTECTED]> > To: [email protected] > Sent: Wednesday, December 10, 2008 12:13:15 PM > Subject: proposal: fetcher performance improvements > > Hi all, > > I've been recently working on a search application and running into trouble > sustaining a reasonable fetch rate. I have plenty of inbound bandwidth but > can't manage to configure the fetchers in such a way that I sustain more > than 20mbit or so inbound. My issue is that the distribution across hosts > falls in a Pareto distribution - the majority of pages are held by the > minority of hosts. > > I tried setting down the max.per.host configurations to a few thousand, > which helped distribute the crawl activity over the length of the job, but I > was still left with a lot of "straggler" hosts for the last few hours - > mostly hosts that had set their crawl delay much higher. > > I see the main issues as: > > 1) The queueing mechanism in Fetcher2 is a bit clunky since you can't > configure (afaik) the depth of the queue. Even at the start of the job I > have the vast majority of fetcher threads spinning even though there are > thousands of unique hosts left to fetch from. > > 2) There is no good way of dealing with disparity in crawl delay between > hosts. If there are 1000 URLs from host A and host B in the crawl, but host > B has a 30 second crawl delay, the minimum job time is goig to be 30000 > seconds. > > I'm working on #1 in my new Fetcher implementation (for which there's a > JIRA) and I'd propose solving #2 like this: > > - Add a new configuration: > key: fetcher.abandon.queue.percentage > default: 0 > description: once the number of unique hosts left in the fetch queue drops > below this percentage of the number of fetcher threads divided by the number > of fetchers per host, abandon the remaining items by putting them in "retry > 0" state > > and (to help with both 1 and 2): > key: fetcher.queue.lookahead.ratio > default: 50 (from Fetcher2.java:875) > description: the maximum number of fetch items that will be queued ahead > of where the fetcher is currently working. By setting this value high, the > fetcher can do a better job of keeping all of its fetch threads busy at the > expense of more RAM. Estimated RAM usage is sizeof(CrawlDatum) * numfetchers > * lookaheadratio * some small overhead. > > My guess is that the size of a CrawlDatum is at worst 1KB, so with typical > values of fetch thread count <100, the default ratio of 50 allocates only > 5MB of RAM for the fetch queue. I'd personally rather tune this up 20x or so > since Fetcher is otherwise not that RAM intensive. > > Any thoughts on this? My end goal: given a segment with topN=1M and > max.per.host=10K, 100 fetcher threads, and crawl.delay = 1sec, I'd expect > the total fetch time (ignoring distribution) to be max in the range of > 10,000 seconds, with expected time much lower since most hosts won't hit the > max.per.host limit. > > Any thoughts? Has anyone else found the sweet spot for configuration when > inbound bandwidth and disk throughput are not an issue? > > Thanks > -Todd
