Yes.

Increase the replication.  Wait.  Drop the replication.


On 3/12/08 3:44 PM, "Chris Dyer" <[EMAIL PROTECTED]> wrote:

> Thanks-- that should work.  I'll follow up with the cluster
> administrators to see if I can get this to happen.  To rebalance the
> file storage can I just set the replication factor using "hadoop dfs"?
> Chris
> 
> On Wed, Mar 12, 2008 at 6:36 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>> 
>>  What about just taking down half of the nodes and then loading your data
>>  into the remainder?  Should take about 20 minutes each time you remove nodes
>>  but only a few seconds each time you add some.  Remember that you need to
>>  reload the data each time (or rebalance it if growing the cluster) to get
>>  realistic numbers.
>> 
>>  My suggested procedure would be to take all but 2 nodes down, and then
>> 
>>  - run test
>>  - double number of nodes
>>  - rebalance file storage
>>  - lather, rinse, repeat
>> 
>> 
>> 
>> 
>>  On 3/12/08 3:28 PM, "Chris Dyer" <[EMAIL PROTECTED]> wrote:
>> 
>>> Hi Hadoop mavens-
>>> I'm hoping someone out there will have a quick solution for me.  I'm
>>> trying to run some very basic scaling experiments for a rapidly
>>> approaching paper deadline on a 16.0 Hadoop cluster that has ~20 nodes
>>> with 2 procs/node.  Ideally, I would want to run my code on clusters
>>> of different numbers of nodes (1, 2, 4, 8, 16) or some such thing.
>>> The problem is that I am not able to reconfigure the cluster (in the
>>> long run, i.e., before a final version of the paper, I assume this
>>> will be possible, but for now it's not).  Setting the number of
>>> mappers/reducers does not seem to be a viable option, at least not in
>>> the trivial way, since the physical layout of the input files makes
>>> hadoop run different tasks of processes than I may request (most of my
>>> jobs consist of multiple MR steps, the initial one always running on a
>>> relatively small data set, which fits into a single block, and
>>> therefore the Hadoop framework does honor my task number request on
>>> the first job-- but during the later ones it does not).
>>> 
>>> My questions:
>>> 1) can I get around this limitation programmatically?  I.e., is there
>>> a way to tell the framework to only use a subset of the nodes for DFS
>>> / mapping / reducing?
>>> 2) if not, what statistics would be good to report if I can only have
>>> two data points -- a legacy "single-core" implementation of the
>>> algorithms and a MapReduce version running on a cluster full cluster?
>>> 
>>> Thanks for any suggestions!
>>> Chris
>> 
>> 

Reply via email to