Re: Scalability for one site

Andrzej Bialecki Mon, 16 Nov 2009 12:07:41 -0800

Mark Kerzner wrote:

Hi,


I want to politely crawl a site with 1-2 million pages. With the speed of
about 1-2 seconds per fetch, it will take weeks. Can I run Nutch on Hadoop,
and can I coordinate the crawlers so as not to cause a DOS attack?

Your Hadoop cluster does not increase the scalability of the targetserver and that's the crux of the matter - whether you use Hadoop ornot, multiple threads or a single thread, if you want to be polite youwill be able to do just 1 req/sec and that's it.

You can prioritize certain pages for fetching so that you get the mostinteresting pages first (whatever "interesting" means).

I know that URLs from one domain as assigned to one fetch segment, and
polite crawling is enforced. Should I use lower-level parts of Nutch?

The built-in limits are there to avoid causing pain for inexperiencedsearch engine operators (and webmasters who are their victims). Thesource code is there, if you choose you can modify it to bypass theserestrictions, just be aware of the consequences (and don't use "Nutch"as your user agent ;) ).


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Scalability for one site

Reply via email to