Yes. Increase the replication. Wait. Drop the replication.
On 3/12/08 3:44 PM, "Chris Dyer" <[EMAIL PROTECTED]> wrote: > Thanks-- that should work. I'll follow up with the cluster > administrators to see if I can get this to happen. To rebalance the > file storage can I just set the replication factor using "hadoop dfs"? > Chris > > On Wed, Mar 12, 2008 at 6:36 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: >> >> What about just taking down half of the nodes and then loading your data >> into the remainder? Should take about 20 minutes each time you remove nodes >> but only a few seconds each time you add some. Remember that you need to >> reload the data each time (or rebalance it if growing the cluster) to get >> realistic numbers. >> >> My suggested procedure would be to take all but 2 nodes down, and then >> >> - run test >> - double number of nodes >> - rebalance file storage >> - lather, rinse, repeat >> >> >> >> >> On 3/12/08 3:28 PM, "Chris Dyer" <[EMAIL PROTECTED]> wrote: >> >>> Hi Hadoop mavens- >>> I'm hoping someone out there will have a quick solution for me. I'm >>> trying to run some very basic scaling experiments for a rapidly >>> approaching paper deadline on a 16.0 Hadoop cluster that has ~20 nodes >>> with 2 procs/node. Ideally, I would want to run my code on clusters >>> of different numbers of nodes (1, 2, 4, 8, 16) or some such thing. >>> The problem is that I am not able to reconfigure the cluster (in the >>> long run, i.e., before a final version of the paper, I assume this >>> will be possible, but for now it's not). Setting the number of >>> mappers/reducers does not seem to be a viable option, at least not in >>> the trivial way, since the physical layout of the input files makes >>> hadoop run different tasks of processes than I may request (most of my >>> jobs consist of multiple MR steps, the initial one always running on a >>> relatively small data set, which fits into a single block, and >>> therefore the Hadoop framework does honor my task number request on >>> the first job-- but during the later ones it does not). >>> >>> My questions: >>> 1) can I get around this limitation programmatically? I.e., is there >>> a way to tell the framework to only use a subset of the nodes for DFS >>> / mapping / reducing? >>> 2) if not, what statistics would be good to report if I can only have >>> two data points -- a legacy "single-core" implementation of the >>> algorithms and a MapReduce version running on a cluster full cluster? >>> >>> Thanks for any suggestions! >>> Chris >> >>
