Thanks-- that should work. I'll follow up with the cluster administrators to see if I can get this to happen. To rebalance the file storage can I just set the replication factor using "hadoop dfs"? Chris
On Wed, Mar 12, 2008 at 6:36 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > What about just taking down half of the nodes and then loading your data > into the remainder? Should take about 20 minutes each time you remove nodes > but only a few seconds each time you add some. Remember that you need to > reload the data each time (or rebalance it if growing the cluster) to get > realistic numbers. > > My suggested procedure would be to take all but 2 nodes down, and then > > - run test > - double number of nodes > - rebalance file storage > - lather, rinse, repeat > > > > > On 3/12/08 3:28 PM, "Chris Dyer" <[EMAIL PROTECTED]> wrote: > > > Hi Hadoop mavens- > > I'm hoping someone out there will have a quick solution for me. I'm > > trying to run some very basic scaling experiments for a rapidly > > approaching paper deadline on a 16.0 Hadoop cluster that has ~20 nodes > > with 2 procs/node. Ideally, I would want to run my code on clusters > > of different numbers of nodes (1, 2, 4, 8, 16) or some such thing. > > The problem is that I am not able to reconfigure the cluster (in the > > long run, i.e., before a final version of the paper, I assume this > > will be possible, but for now it's not). Setting the number of > > mappers/reducers does not seem to be a viable option, at least not in > > the trivial way, since the physical layout of the input files makes > > hadoop run different tasks of processes than I may request (most of my > > jobs consist of multiple MR steps, the initial one always running on a > > relatively small data set, which fits into a single block, and > > therefore the Hadoop framework does honor my task number request on > > the first job-- but during the later ones it does not). > > > > My questions: > > 1) can I get around this limitation programmatically? I.e., is there > > a way to tell the framework to only use a subset of the nodes for DFS > > / mapping / reducing? > > 2) if not, what statistics would be good to report if I can only have > > two data points -- a legacy "single-core" implementation of the > > algorithms and a MapReduce version running on a cluster full cluster? > > > > Thanks for any suggestions! > > Chris > >
