Re: To mapred or not

Doug Cutting Thu, 01 Sep 2005 09:36:36 -0700

Kelvin Tan wrote:

Seeing mapred is about to be folded into trunk, 3 questions:


1. Any benchmarks/estimates on when the scalability of map-reduce surpasses its 
overhead/complexity? e.g. with > 10 reduce workers..

I think that with as few as two boxes it will outperform uniprocessorNutch. This will not be true for very small collections, since theoverhead of starting JVMs can dominate those. (MapReduce runs each taskin a separate JVM, for robustness.)

2. Will there be an option of a plain vanilla single-box Nutch crawler vs a 
map-reduce version?

Yes, there already is. By default MapReduce runs on a single box in asingle JVM. To run on multiple boxes in multiple JVMs one must alterthe default configuration (to name the jobtracker server) and start thejobtracker daemon and one or more tasktracker daemons. There are shellscripts to assist with the management of daemons.

3. What are the options for users who don't want to jump onboard map-red? Will 
pre-mapred be actively maintained?

The MapReduce versons of Nutch tools (inject, generate, fetch, etc.) arenot a large amount of code. One could easily build compatiblenon-MapReduce versions of these tools. But then we'd be maintaining twoversions, so we should avoid this as much as possible. However, if someapplications need a very different control flow, then that might warrantthis. For example, one might write a crawler that combines a number ofthese tools in a single process, e.g., using RDBMS to keep track of urlsand links while crawling, updating the database in real-time as URLs arefetched. Such an architecture would not be as scalable, but might excelin other ways.

It would be worth considering which features of your constrained crawlercould be cast as improvements to Nutch's existing tools (e.g., moreseed url formats, more output formats, http 1.1, custom scopes, etc.)and which require a different control flow (online fetchlist building?).In some cases (e.g., fetch prioritization) perhaps a new Plugin shouldbe added to Nutch.

In the mapred branch the webdb has been decomposed into a crawldb and alinkdb. The crawldb is much smaller and simpler than the former webdb,containing only an entry for each known URL. This makes updates muchfaster while crawling. The linkdb contains only the link graph, andneeds only be updated prior to indexing, not with each step whilecrawling as before.


Doug

Re: To mapred or not

Reply via email to