Am 07.01.2006 um 00:37 schrieb Chris Schneider:
Gang,
We'd like to take a crack at moving to MapReduce and are evaluating
various options for building a relatively small rack of (5-10?)
machines to play around with. I'm hoping that somebody out there
would be willing to make a quick pass at the minimum (or
appropriate?) hardware requirements for MapReduce. The following
are my working assumptions based on a quick look through the online
documentation, but please feel free to point out all glaring gaps
in my understanding:
1) The whole point of MapReduce is to spread the workload across
many relatively low-end machines. Thus, I'm guessing that the
TaskTracker machines would have roughly the same RAM and CPU
requirements as Nutch 0.7 (2GB RAM, 1GHz CPU, perhaps even less?)
it's ok. Fast network is important.
2) The TaskTracker nodes should probably also be DataNodes in such
a relatively small system. No significant data is saved on the
TaskTracker machine, except in its role as a DataNode.
Right.
3) The NameNode box probably wants to keep large indexes of blocks
in memory, but I wouldn't expect these to exceed the same 2GB
metric we're using for the TaskTrackers. Likewise, I wouldn't
expect the CPU speed to be a major constraint (mostly network
bound). Finally, I can't imagine why the NameNode would need tons
of disk space.
Right, better more RAM than hdd for the name node. But having 10
identically boxes is may cheaper than having individual configurations.
4) I would imagine that the JobTracker would have even less need
for big RAM and a fast CPU, let alone hard drive space. I'd
probably start with this running on the same box as the NameNode.
Right, but jobtracker need more resources as you have more
tasktracker, so don't have to less.
5) I would imagine that you'd want to scale the combined disk
capacity of all the DataNodes in the rack to 3x what you'd need
with Nutch 0.7, since mapReduce tries to distribute multiple copies
of the data blocks across DataNodes. I guess this means that we'd
need (10K/page)*(100M pages)*3=3TB total. I guess we'd want a total
of 5TB of disk space in the rack to be safe.
You can configure how many block copies exists in the ndfs, but don't
ask me where you can do that.
6) Even if the NDFS is able to keep most of each TaskTracker's
writes local (i.e., to the DataNode running on the same box),
you've still got the overhead of block replication, which is surely
network-bound. Thus, I'm guessing that slower SATA hard drives
would be more than sufficient.
SATA is fair, network speed is important.
7) Since the local network will probably be the gating performance
parameter, we'll need a 1GB network.
right...or faster. :-)
Although the empirical data may be scarce, any insight into
MapReduce hardware requirements would be quite helpful as we
evaluate this rack investment.
You ideas sounds good.
Stefan
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general