Hi,

I wonder if someone is aware of any measurement of Hadoop scalability with
the cluster size, eg., read/write/appends throughput on a cluster of
5,10,30,50,100 nodes, or something alike.

These numbers would help us to plan the resources for a small academic
cluster. We are concerned about the possibility of Hadoop not working
properly if there are too few nodes (too many map/reduce jobs per node or
too many replicas per node, for example). We currently have 4 nodes (16GB of
ram, 6 * 750 GB disks, Quad-Core AMD Opteron processor). Our initial plans
are to perform a Web crawl for academic purposes (something between 500
million and 1 billion pages), and we need to expand the number of nodes for
that. Is it better to have a larger number of nodes simpler than the ones we
currently have (less memory, less processing?) in terms of Hadoop
performance?

Thank you in advance for any information!

Guilherme Menezes

Reply via email to