2009/11/16 Mark Kerzner <markkerz...@gmail.com>:
> Hi,
>
> I want to politely crawl a site with 1-2 million pages. With the speed of
> about 1-2 seconds per fetch, it will take weeks. Can I run Nutch on Hadoop,
> and can I coordinate the crawlers so as not to cause a DOS attack?

Nutch basically uses hadoop - or an older version of hadoop. So yes -
it can run on a hadoop style cluster.

I *think* the way it is split up will only put one site on one node,
leaving you back at square one.

However I would say that 1 second per fetch is quite polite and any
faster is a bit rude. So I fail to see what you gain by using multiple
machines...




> I know that URLs from one domain as assigned to one fetch segment, and
> polite crawling is enforced. Should I use lower-level parts of Nutch?

Do you own the site being crawled?

Reply via email to