What about just taking down half of the nodes and then loading your data into the remainder? Should take about 20 minutes each time you remove nodes but only a few seconds each time you add some. Remember that you need to reload the data each time (or rebalance it if growing the cluster) to get realistic numbers.
My suggested procedure would be to take all but 2 nodes down, and then - run test - double number of nodes - rebalance file storage - lather, rinse, repeat On 3/12/08 3:28 PM, "Chris Dyer" <[EMAIL PROTECTED]> wrote: > Hi Hadoop mavens- > I'm hoping someone out there will have a quick solution for me. I'm > trying to run some very basic scaling experiments for a rapidly > approaching paper deadline on a 16.0 Hadoop cluster that has ~20 nodes > with 2 procs/node. Ideally, I would want to run my code on clusters > of different numbers of nodes (1, 2, 4, 8, 16) or some such thing. > The problem is that I am not able to reconfigure the cluster (in the > long run, i.e., before a final version of the paper, I assume this > will be possible, but for now it's not). Setting the number of > mappers/reducers does not seem to be a viable option, at least not in > the trivial way, since the physical layout of the input files makes > hadoop run different tasks of processes than I may request (most of my > jobs consist of multiple MR steps, the initial one always running on a > relatively small data set, which fits into a single block, and > therefore the Hadoop framework does honor my task number request on > the first job-- but during the later ones it does not). > > My questions: > 1) can I get around this limitation programmatically? I.e., is there > a way to tell the framework to only use a subset of the nodes for DFS > / mapping / reducing? > 2) if not, what statistics would be good to report if I can only have > two data points -- a legacy "single-core" implementation of the > algorithms and a MapReduce version running on a cluster full cluster? > > Thanks for any suggestions! > Chris
