I'm glad it helped Aniket. I would recommend that you start working on performance improvement with your network infrastructure and the balance of data across your logical racks.Cliff
On Fri, Sep 24, 2010 at 12:12 AM, aniket ray <[email protected]> wrote: > Hi Cliff, > > Thanks it did turn out to be speculative execution. When I turned it off, > no > more tasks were killed and the performance degraded. > > So my initial assumptions were incorrect after all. I guess I'll have to > look at other ways to improve performance. > > Thanks for the help. > -aniket > > On Thu, Sep 23, 2010 at 5:14 PM, cliff palmer <[email protected]> > wrote: > > > Aniket, I wonder if these tasks were run as Speculative Execution. Have > > you > > been able to determine whether the job runs successfully? > > HTH > > Cliff > > > > On Thu, Sep 23, 2010 at 12:52 AM, aniket ray <[email protected]> > wrote: > > > > > Hi, > > > > > > I continuously run a series of batch job using Hadoop Map Reduce. I > also > > > have a managing daemon that moves data around on the hdfs making way > for > > > more jobs to be run. > > > I use capacity scheduler to schedule many jobs in parallel. > > > > > > I see an issue on the Hadoop web monitoring UI at port 50030 which I > > > believe > > > may be causing a performance bottleneck and wanted to get more > > information. > > > > > > Approximately 10% of the reduce tasks show up as "Killed" in the UI. > The > > > logs say that the killed tasks are in the shuffle phase when they are > > > killed > > > but the logs don't show any exception. > > > My understanding is that these killed tasks would be started again and > > this > > > slows down the whole hadoop job. > > > I was wondering what the possible issues maybe and how to debug this > > issue? > > > > > > I have tried on both the hadoop 0.20.2 and the latest version of hadoop > > > from > > > yahoo's github. > > > I've monitored the nodes and there is a lot of free disk space and > memory > > > on > > > all nodes (more than 1 TB free disk and 5 GB free memory at all times > on > > > all > > > nodes). > > > > > > Since there are no exceptions and any other visible issues, I am > finding > > it > > > hard to figure out what the problem might be. Could anybody help? > > > > > > Thanks, > > > -aniket > > > > > >
