Am 07.01.2006 um 00:37 schrieb Chris Schneider:

Gang,

We'd like to take a crack at moving to MapReduce and are evaluating various options for building a relatively small rack of (5-10?) machines to play around with. I'm hoping that somebody out there would be willing to make a quick pass at the minimum (or appropriate?) hardware requirements for MapReduce. The following are my working assumptions based on a quick look through the online documentation, but please feel free to point out all glaring gaps in my understanding:

1) The whole point of MapReduce is to spread the workload across many relatively low-end machines. Thus, I'm guessing that the TaskTracker machines would have roughly the same RAM and CPU requirements as Nutch 0.7 (2GB RAM, 1GHz CPU, perhaps even less?)

it's ok. Fast network is important.
2) The TaskTracker nodes should probably also be DataNodes in such a relatively small system. No significant data is saved on the TaskTracker machine, except in its role as a DataNode.
Right.

3) The NameNode box probably wants to keep large indexes of blocks in memory, but I wouldn't expect these to exceed the same 2GB metric we're using for the TaskTrackers. Likewise, I wouldn't expect the CPU speed to be a major constraint (mostly network bound). Finally, I can't imagine why the NameNode would need tons of disk space.
Right, better more RAM than hdd for the name node. But having 10 identically boxes is may cheaper than having individual configurations.

4) I would imagine that the JobTracker would have even less need for big RAM and a fast CPU, let alone hard drive space. I'd probably start with this running on the same box as the NameNode.
Right, but jobtracker need more resources as you have more tasktracker, so don't have to less.

5) I would imagine that you'd want to scale the combined disk capacity of all the DataNodes in the rack to 3x what you'd need with Nutch 0.7, since mapReduce tries to distribute multiple copies of the data blocks across DataNodes. I guess this means that we'd need (10K/page)*(100M pages)*3=3TB total. I guess we'd want a total of 5TB of disk space in the rack to be safe.
You can configure how many block copies exists in the ndfs, but don't ask me where you can do that.

6) Even if the NDFS is able to keep most of each TaskTracker's writes local (i.e., to the DataNode running on the same box), you've still got the overhead of block replication, which is surely network-bound. Thus, I'm guessing that slower SATA hard drives would be more than sufficient.

SATA is fair, network speed is important.
7) Since the local network will probably be the gating performance parameter, we'll need a 1GB network.
right...or faster. :-)

Although the empirical data may be scarce, any insight into MapReduce hardware requirements would be quite helpful as we evaluate this rack investment.
You ideas sounds good.

Stefan

Reply via email to