Kelvin Tan wrote:
Seeing mapred is about to be folded into trunk, 3 questions:

1. Any benchmarks/estimates on when the scalability of map-reduce surpasses its 
overhead/complexity? e.g. with > 10 reduce workers..

I think that with as few as two boxes it will outperform uniprocessor Nutch. This will not be true for very small collections, since the overhead of starting JVMs can dominate those. (MapReduce runs each task in a separate JVM, for robustness.)

2. Will there be an option of a plain vanilla single-box Nutch crawler vs a 
map-reduce version?

Yes, there already is. By default MapReduce runs on a single box in a single JVM. To run on multiple boxes in multiple JVMs one must alter the default configuration (to name the jobtracker server) and start the jobtracker daemon and one or more tasktracker daemons. There are shell scripts to assist with the management of daemons.

3. What are the options for users who don't want to jump onboard map-red? Will 
pre-mapred be actively maintained?

The MapReduce versons of Nutch tools (inject, generate, fetch, etc.) are not a large amount of code. One could easily build compatible non-MapReduce versions of these tools. But then we'd be maintaining two versions, so we should avoid this as much as possible. However, if some applications need a very different control flow, then that might warrant this. For example, one might write a crawler that combines a number of these tools in a single process, e.g., using RDBMS to keep track of urls and links while crawling, updating the database in real-time as URLs are fetched. Such an architecture would not be as scalable, but might excel in other ways.

It would be worth considering which features of your constrained crawler could be cast as improvements to Nutch's existing tools (e.g., more seed url formats, more output formats, http 1.1, custom scopes, etc.) and which require a different control flow (online fetchlist building?). In some cases (e.g., fetch prioritization) perhaps a new Plugin should be added to Nutch.

In the mapred branch the webdb has been decomposed into a crawldb and a linkdb. The crawldb is much smaller and simpler than the former webdb, containing only an entry for each known URL. This makes updates much faster while crawling. The linkdb contains only the link graph, and needs only be updated prior to indexing, not with each step while crawling as before.

Doug

Reply via email to