Kelvin Tan wrote:
Seeing mapred is about to be folded into trunk, 3 questions:
1. Any benchmarks/estimates on when the scalability of map-reduce surpasses its
overhead/complexity? e.g. with > 10 reduce workers..
I think that with as few as two boxes it will outperform uniprocessor
Nutch. This will not be true for very small collections, since the
overhead of starting JVMs can dominate those. (MapReduce runs each task
in a separate JVM, for robustness.)
2. Will there be an option of a plain vanilla single-box Nutch crawler vs a
map-reduce version?
Yes, there already is. By default MapReduce runs on a single box in a
single JVM. To run on multiple boxes in multiple JVMs one must alter
the default configuration (to name the jobtracker server) and start the
jobtracker daemon and one or more tasktracker daemons. There are shell
scripts to assist with the management of daemons.
3. What are the options for users who don't want to jump onboard map-red? Will
pre-mapred be actively maintained?
The MapReduce versons of Nutch tools (inject, generate, fetch, etc.) are
not a large amount of code. One could easily build compatible
non-MapReduce versions of these tools. But then we'd be maintaining two
versions, so we should avoid this as much as possible. However, if some
applications need a very different control flow, then that might warrant
this. For example, one might write a crawler that combines a number of
these tools in a single process, e.g., using RDBMS to keep track of urls
and links while crawling, updating the database in real-time as URLs are
fetched. Such an architecture would not be as scalable, but might excel
in other ways.
It would be worth considering which features of your constrained crawler
could be cast as improvements to Nutch's existing tools (e.g., more
seed url formats, more output formats, http 1.1, custom scopes, etc.)
and which require a different control flow (online fetchlist building?).
In some cases (e.g., fetch prioritization) perhaps a new Plugin should
be added to Nutch.
In the mapred branch the webdb has been decomposed into a crawldb and a
linkdb. The crawldb is much smaller and simpler than the former webdb,
containing only an entry for each known URL. This makes updates much
faster while crawling. The linkdb contains only the link graph, and
needs only be updated prior to indexing, not with each step while
crawling as before.
Doug