Alex, Thank you for the answer. As for your last question - no, I don't own that site. I am looking for specific information type, and that is the first site I want to crawl.
Mark On Mon, Nov 16, 2009 at 1:54 PM, Alex McLintock <alex.mclint...@gmail.com>wrote: > 2009/11/16 Mark Kerzner <markkerz...@gmail.com>: > > Hi, > > > > I want to politely crawl a site with 1-2 million pages. With the speed of > > about 1-2 seconds per fetch, it will take weeks. Can I run Nutch on > Hadoop, > > and can I coordinate the crawlers so as not to cause a DOS attack? > > Nutch basically uses hadoop - or an older version of hadoop. So yes - > it can run on a hadoop style cluster. > > I *think* the way it is split up will only put one site on one node, > leaving you back at square one. > > However I would say that 1 second per fetch is quite polite and any > faster is a bit rude. So I fail to see what you gain by using multiple > machines... > > > > > > I know that URLs from one domain as assigned to one fetch segment, and > > polite crawling is enforced. Should I use lower-level parts of Nutch? > > Do you own the site being crawled? >