When you run terasort, pass -Dmapred.reduce.tasks=4 and see how that
goes for you. See this old thread for info:

http://mail-archives.apache.org/mod_mbox/hadoop-common-user/200906.mbox/%3ccbbf4b570906300617ma4505f5o2aa1b9fb87b31...@mail.gmail.com%3E

-Todd
On Fri, Feb 25, 2011 at 4:45 PM, Jeffrey Buell <jbu...@vmware.com> wrote:
>
> Hi Todd,
>
> Thanks for the quick response.
>
> I added <final>true</final> to dfs.replication, but I still get just one 
> output copy.  Can hadoop apps overwrite the replication level even with this 
> parameter?
>
> I tried increasing mapred.tasktracker.reduce.tasks.maximum from 1 to 4, but 
> that didn't make any difference:  output is still all on one node.  I thought 
> that parameter controls the number of reduce tasks per node so 1 should be 
> sufficient, is that correct logic?  The other reduce parameters are at their 
> defaults.  Which parameters should I be trying?
>
> There's no way the sort can run efficiently if the gen is unbalanced.  At 5 
> GB, there're plenty of blocks to make the distribution even.  How does HDFS 
> decide to spread the storage across the nodes?  Note the sizes of the 4 
> chunks (1.7,...,3.2 GB) seem to be repeatable, but they land on different 
> nodes each time teragen is run.
>
> Jeff
>
> > -----Original Message-----
> > From: Todd Lipcon [mailto:t...@cloudera.com]
> > Sent: Friday, February 25, 2011 3:13 PM
> > To: hdfs-user@hadoop.apache.org
> > Subject: Re: balancing and replication in HDFS
> >
> > Hi Jeff,
> > The output of terasort has replication level 1 by default. This is so
> > it goes faster with the default settings and makes for more impressive
> > benchmark results :)
> > The reason you see it all on one machine is probably that you're
> > running with one reducer. Try configuring your terasort to use more
> > reduce tasks and you should see the load (and space usage) even out.
> > -Todd
> >
> > On Fri, Feb 25, 2011 at 2:52 PM, Jeffrey Buell <jbu...@vmware.com>
> > wrote:
> > >
> > > I'm a newbie to hadoop and HDFS.  I'm seeing odd behavior in HDFS
> > that I hope somebody can clear up for me.  I'm running hadoop version
> > 0.20.1+169.127 from the cloudera distro on 4 identical nodes, each with
> > 4 cpus and 100GB disk space.  Replication is set to 2.
> > >
> > > I run:
> > >
> > > hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar teragen 50000000
> > tera_in5
> > >
> > > This produces the expected 10GB of data on disk (5GB * 2 copies).
> >  But the data is spread very unevenly across the nodes, ranging from
> > 1.7 to 3.2 GB on each node.  Then I sort the data:
> > >
> > > hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar terasort tera_in5
> > tera_out5
> > >
> > > It finishes successfully, and HDFS recognizes the right amount of
> > data:
> > >
> > > $ hadoop fs -du /user/hadoop/
> > > Found 2 items
> > > 5000023410  hdfs://namd-1/user/hadoop/tera_in5
> > > 5000170993  hdfs://namd-1/user/hadoop/tera_out5
> > >
> > > However all the new data is on one node (apparently randomly chosen),
> > and the total disk usage is only 15GB, which means that the output data
> > is not replicated.  For nearly all the elapsed time of the sort, the
> > other 3 nodes are idle.  Some of the output data is in
> > dfs/data/current, but a lot is in one of 64 new subdirs
> > (dfs/data/current/subdir0 through subdir63).
> > >
> > > Why is all this happening?  Am I missing some tunables that make HDFS
> > do the right balance and replication?
> > >
> > > Thanks,
> > >
> > > Jeff
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera



--
Todd Lipcon
Software Engineer, Cloudera

Reply via email to