Re: Update to URL ordering from Generator.java

Ken Krugler Wed, 24 Oct 2007 12:57:20 -0700

That's really interesting. From the literature I have been reading,I always though that the politeness policies required a 1s - 5s waitbetween queries to the same host. However, I always thought thatregardless of HTTP/1.0 or HTTP/1.1 this was the case and if not, dowe really want to add the logic of determining if this is a"num-sites-per-minute-host" or a "num-sites-per-hour-host"

Actually it's a num-requests-per-xxx setting. Where typically 1request == 1 web page, if you exclude things like PDFs.

or a "1-request-per-second-host"? If we assume the former optionsthen we will definitely increase our throughput

The throughput would depend on what setting you use. E.g. if thesetting was 20 pages/minute, this would be roughly the equivalent(leaving aside latency) of a 3 second delay between requests.

but most likely get a lot of hate mail from webmasters for smallersites, and if we assume the latter than we are where we started.

For the small number - 5 or so - of ops people I surveyed, they _all_cared about requests per time interval, not time delay betweenrequests. And having gone through the experience of getting a bunchof pages crawled on our site, I understand why.

Nobody that I know of monitors individual request timing - it's allbased on summed, time-bucketed info for an originating IP address, orblock of addresses.

So my main point is that currently it feels like a mismatch betweenhow Nutch tries to implement polite crawling, versus how an opsperson would evaluate the politeness of a crawler.

A secondary point is that it should be possible to use keep-alive,and still be considered polite, if ops people really did care aboutrequests/time interval.

Which could lead to significantly more efficient crawling. Forexample, see page 6 of "Scheduling Algorithms for Web Crawling",http://www.dcc.uchile.cl/~ccastill/papers/castillo04_scheduling_algorithms_web_crawling.pdf).In their tests, with a big pipe, the crawl was 3x faster...though itdoesn't say how they constrained the per-domain crawling rate toremain polite.


-- Ken

PS - Also note that the HTTP 1.1 response header can contain theserver's max requests/connection value, so the site admin does havecontrol over what they feel is a reasonable limit.

On Oct 24, 2007, at 8:29 AM, Ken Krugler wrote:
Ned Rockson wrote:
So recently I switched to Fetcher2 over Fetcher for larger wholeweb fetches (50-100M at a time). I found that the URLs generatedare not optimal because they are simply randomized by a hashcomparator. In one crawl on 24 machines it took about 3 days tocrawl 30M URLs. In comparison with old benchmarks I had set withregular Fetcher.java this was at least 3 fold more time.
Anyway, I realized that the best situation for ordering can beapproached by randomization, but in order to get optimalordering, urls from the same host should be as far apart in thelist as possible. So I wrote a series of 2 map/reduces tooptimize the ordering and for a list of 25M documents it takesabout 10 minutes on our cluster. Right now I have it in its ownclass, but I figured it can go in Generator.java and just add aflag in nutch-default.xml determining if the user wants to use it.So, should I submit the code by email? Is there some way tochange Generator.java or should I just submit the function in itsown class?
This sounds intriguing. Nutch mailing lists strip attachments -you should create a JIRA issue, copy this description, and attacha patch in unified diff format (svn diff will do).
For what it's worth, a year ago I had a series of discussions withpeople who manage large web sites, w.r.t. acceptable crawl rates.
This was in the context of whether using keep-alive made sense, tooptimize crawl performance.
For sites that mostly serve up static content, keep-alive was seenas a good thing, in that it reduced the load on the serverconstantly creating/destroying connections.
For sites that build pages from multiple data sources, this was aminor issue - they were more concerned about total # of pagesfetched per time interval, given the high cost of assembling thepage.
Which is the interesting point - nobody talked about the amount oftime between requests. They all were interested in monitoring pagesper time interval - typically an hour for smaller sites, and pagesper minute for ones with heavier traffic.
Based on the above, what seems like the most efficient approach tofetching would be to use a pages/minute/domain restriction, and usea kept-alive connection (w/HTTP 1.1) with no delay until that limitis reached, then switch to another set of URLs from a differentdomain.
This assumes a maximum of one thread/domain, and that URLs aresegmented such that each domain is handled inside of one task, suchthat the one thread/domain and pages/minute/domain restrictions canbe enforced properly.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"



--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Re: Update to URL ordering from Generator.java

Reply via email to