Chris Schneider wrote:
2) The TaskTracker nodes should probably also be DataNodes in such a relatively small system. No significant data is saved on the TaskTracker machine, except in its role as a DataNode.

It is actually optimal for TaskTracker and DataNodes to both be run on all slave boxes. That way map tasks can be assigned to nodes where their input data is local, and reduce tasks can write the first copy of their output locally, reducing network i/o. (These optimizations are not in the current code, but will be soon.)

3) The NameNode box probably wants to keep large indexes of blocks in memory, but I wouldn't expect these to exceed the same 2GB metric we're using for the TaskTrackers. Likewise, I wouldn't expect the CPU speed to be a major constraint (mostly network bound). Finally, I can't imagine why the NameNode would need tons of disk space.

4) I would imagine that the JobTracker would have even less need for big RAM and a fast CPU, let alone hard drive space. I'd probably start with this running on the same box as the NameNode.

I typically run the NameNode and JobTracker on the same box, the master. Ideally this box might be configured differently (e.g.,, a RAID for higher disk reliability) but practically speaking its fine and simpler to have it configured the same as the others. I usually run a cron entry on the NameNode box which periodically copies NDFS name data to another drive or machine with rsync, since this is a single-point of failure.

7) Since the local network will probably be the gating performance parameter, we'll need a 1GB network.

Yes, I've benchmarked 30 & 180 node NDFS systems with 100MB networking, and the network does appear the be the bottleneck.

Doug

Reply via email to