Alex,

Thank you for the answer. As for your last question - no, I don't own that
site. I am looking for specific information type, and that is the first site
I want to crawl.

Mark

On Mon, Nov 16, 2009 at 1:54 PM, Alex McLintock <alex.mclint...@gmail.com>wrote:

> 2009/11/16 Mark Kerzner <markkerz...@gmail.com>:
> > Hi,
> >
> > I want to politely crawl a site with 1-2 million pages. With the speed of
> > about 1-2 seconds per fetch, it will take weeks. Can I run Nutch on
> Hadoop,
> > and can I coordinate the crawlers so as not to cause a DOS attack?
>
> Nutch basically uses hadoop - or an older version of hadoop. So yes -
> it can run on a hadoop style cluster.
>
> I *think* the way it is split up will only put one site on one node,
> leaving you back at square one.
>
> However I would say that 1 second per fetch is quite polite and any
> faster is a bit rude. So I fail to see what you gain by using multiple
> machines...
>
>
>
>
> > I know that URLs from one domain as assigned to one fetch segment, and
> > polite crawling is enforced. Should I use lower-level parts of Nutch?
>
> Do you own the site being crawled?
>

Reply via email to