Chris Schneider wrote:
2) The TaskTracker nodes should probably also be DataNodes in such a
relatively small system. No significant data is saved on the TaskTracker
machine, except in its role as a DataNode.
It is actually optimal for TaskTracker and DataNodes to both be run on
all slave boxes. That way map tasks can be assigned to nodes where
their input data is local, and reduce tasks can write the first copy of
their output locally, reducing network i/o. (These optimizations are
not in the current code, but will be soon.)
3) The NameNode box probably wants to keep large indexes of blocks in
memory, but I wouldn't expect these to exceed the same 2GB metric we're
using for the TaskTrackers. Likewise, I wouldn't expect the CPU speed to
be a major constraint (mostly network bound). Finally, I can't imagine
why the NameNode would need tons of disk space.
4) I would imagine that the JobTracker would have even less need for big
RAM and a fast CPU, let alone hard drive space. I'd probably start with
this running on the same box as the NameNode.
I typically run the NameNode and JobTracker on the same box, the master.
Ideally this box might be configured differently (e.g.,, a RAID for
higher disk reliability) but practically speaking its fine and simpler
to have it configured the same as the others. I usually run a cron
entry on the NameNode box which periodically copies NDFS name data to
another drive or machine with rsync, since this is a single-point of
failure.
7) Since the local network will probably be the gating performance
parameter, we'll need a 1GB network.
Yes, I've benchmarked 30 & 180 node NDFS systems with 100MB networking,
and the network does appear the be the bottleneck.
Doug