Hi, I wonder if someone is aware of any measurement of Hadoop scalability with the cluster size, eg., read/write/appends throughput on a cluster of 5,10,30,50,100 nodes, or something alike.
These numbers would help us to plan the resources for a small academic cluster. We are concerned about the possibility of Hadoop not working properly if there are too few nodes (too many map/reduce jobs per node or too many replicas per node, for example). We currently have 4 nodes (16GB of ram, 6 * 750 GB disks, Quad-Core AMD Opteron processor). Our initial plans are to perform a Web crawl for academic purposes (something between 500 million and 1 billion pages), and we need to expand the number of nodes for that. Is it better to have a larger number of nodes simpler than the ones we currently have (less memory, less processing?) in terms of Hadoop performance? Thank you in advance for any information! Guilherme Menezes
