Re: Appropriate MapReduce Hardware

Doug Cutting Mon, 09 Jan 2006 11:38:47 -0800

Chris Schneider wrote:

2) The TaskTracker nodes should probably also be DataNodes in such arelatively small system. No significant data is saved on the TaskTrackermachine, except in its role as a DataNode.

It is actually optimal for TaskTracker and DataNodes to both be run onall slave boxes. That way map tasks can be assigned to nodes wheretheir input data is local, and reduce tasks can write the first copy oftheir output locally, reducing network i/o. (These optimizations arenot in the current code, but will be soon.)

3) The NameNode box probably wants to keep large indexes of blocks inmemory, but I wouldn't expect these to exceed the same 2GB metric we'reusing for the TaskTrackers. Likewise, I wouldn't expect the CPU speed tobe a major constraint (mostly network bound). Finally, I can't imaginewhy the NameNode would need tons of disk space.
4) I would imagine that the JobTracker would have even less need for bigRAM and a fast CPU, let alone hard drive space. I'd probably start withthis running on the same box as the NameNode.

I typically run the NameNode and JobTracker on the same box, the master.Ideally this box might be configured differently (e.g.,, a RAID forhigher disk reliability) but practically speaking its fine and simplerto have it configured the same as the others. I usually run a cronentry on the NameNode box which periodically copies NDFS name data toanother drive or machine with rsync, since this is a single-point offailure.

7) Since the local network will probably be the gating performanceparameter, we'll need a 1GB network.

Yes, I've benchmarked 30 & 180 node NDFS systems with 100MB networking,and the network does appear the be the bottleneck.


Doug

Re: Appropriate MapReduce Hardware

Reply via email to