Re: proposal: fetcher performance improvements

Andrzej Bialecki Wed, 10 Dec 2008 23:35:30 -0800

Todd Lipcon wrote:

Hi all,


I've been recently working on a search application and running into trouble
sustaining a reasonable fetch rate. I have plenty of inbound bandwidth but
can't manage to configure the fetchers in such a way that I sustain more
than 20mbit or so inbound. My issue is that the distribution across hosts
falls in a Pareto distribution - the majority of pages are held by the
minority of hosts.


This is a common situation, i.e. your distribution is not unique.


I tried setting down the max.per.host configurations to a few thousand,
which helped distribute the crawl activity over the length of the job, but I
was still left with a lot of "straggler" hosts for the last few hours -
mostly hosts that had set their crawl delay much higher.

Did you try to decrease the max crawl delay? Fetcher can drop items withtoo high crawl delay, which could help.


I see the main issues as:

1) The queueing mechanism in Fetcher2 is a bit clunky since you can't
configure (afaik) the depth of the queue. Even at the start of the job I
have the vast majority of fetcher threads spinning even though there are
thousands of unique hosts left to fetch from.

2) There is no good way of dealing with disparity in crawl delay between
hosts. If there are 1000 URLs from host A and host B in the crawl, but host
B has a 30 second crawl delay, the minimum job time is goig to be 30000
seconds.

I'm working on #1 in my new Fetcher implementation (for which there's a
JIRA) and I'd propose solving #2 like this:

- Add a new configuration:
  key: fetcher.abandon.queue.percentage
  default: 0
  description: once the number of unique hosts left in the fetch queue drops
below this percentage of the number of fetcher threads divided by the number
of fetchers per host, abandon the remaining items by putting them in "retry
0" state

With this strategy you need to add a mechanism to prevent Fetcher fromdropping the items even if the absolute remaining time is stillacceptable ... think about short queues and small fetchlists.


and (to help with both 1 and 2):
  key: fetcher.queue.lookahead.ratio
  default: 50 (from Fetcher2.java:875)
  description: the maximum number of fetch items that will be queued ahead
of where the fetcher is currently working. By setting this value high, the
fetcher can do a better job of keeping all of its fetch threads busy at the
expense of more RAM. Estimated RAM usage is sizeof(CrawlDatum) * numfetchers
* lookaheadratio * some small overhead.

Current implementation (in Fetcher2) tries to do just that, seeQueueFeeder thread. Or did you have something else in mind?


  My guess is that the size of a CrawlDatum is at worst 1KB, so with typical
values of fetch thread count <100, the default ratio of 50 allocates only
5MB of RAM for the fetch queue. I'd personally rather tune this up 20x or so
since Fetcher is otherwise not that RAM intensive.

Any thoughts on this? My end goal: given a segment with topN=1M and
max.per.host=10K, 100 fetcher threads, and crawl.delay = 1sec, I'd expect
the total fetch time (ignoring distribution) to be max in the range of
10,000 seconds, with expected time much lower since most hosts won't hit the
max.per.host limit.

Crawl delay is not the only factor that slows down the fetching. Onecommon issue is hosts that relatively quickly accept connections, butthen very slowly trickle the data byte by byte. Since we have per-hostqueues it would make sense to monitor not only the req/sec rate but alsobytes/sec rate, and if it falls too low then terminate the queue.



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: proposal: fetcher performance improvements

Reply via email to