Raghava: Are you able to share the last segment of reducer log ? You can get them from web UI: http://snv-it-lin-012.pr.com:50060/tasklog?taskid=attempt_201003221148_1211_r_000003_0&start=-8193
Adding more log in your reducer task would help pinpoint where the issue is. Also look in job tracker log. Cheers On Thu, Apr 8, 2010 at 2:46 PM, Raghava Mutharaju <m.vijayaragh...@gmail.com > wrote: > Hi Ted, > > Thank you for the suggestion. I enabled it using the Configuration > class because I cannot change hadoop-site.xml file (I am not an admin). The > situation is still the same --- it gets stuck at reduce 99% and does not > move further. > > Regards, > Raghava. > > On Thu, Apr 8, 2010 at 4:40 PM, Ted Yu <yuzhih...@gmail.com> wrote: > > > You need to turn on yourself (hadoop-site.xml): > > <property> > > <name>mapred.reduce.tasks.speculative.execution</name> > > <value>true</value> > > </property> > > > > <property> > > <name>mapred.map.tasks.speculative.execution</name> > > <value>true</value> > > </property> > > > > > > On Thu, Apr 8, 2010 at 1:14 PM, Raghava Mutharaju < > > m.vijayaragh...@gmail.com > > > wrote: > > > > > Hi, > > > > > > Thank you Eric, Prashant and Greg. Although the timeout problem was > > > resolved, reduce is getting stuck at 99%. As of now, it has been stuck > > > there > > > for about 3 hrs. That is too high a wait time for my task. Do you guys > > see > > > any reason for this? > > > > > > Speculative execution is "on" by default right? Or should I enable > > it? > > > > > > Regards, > > > Raghava. > > > > > > On Thu, Apr 8, 2010 at 3:15 PM, Gregory Lawrence <gr...@yahoo-inc.com > > > >wrote: > > > > > > > Hi, > > > > > > > > I have also experienced this problem. Have you tried speculative > > > execution? > > > > Also, I have had jobs that took a long time for one mapper / reducer > > > because > > > > of a record that was significantly larger than those contained in the > > > other > > > > filesplits. Do you know if it always slows down for the same > filesplit? > > > > > > > > Regards, > > > > Greg Lawrence > > > > > > > > > > > > On 4/8/10 10:30 AM, "Raghava Mutharaju" <m.vijayaragh...@gmail.com> > > > wrote: > > > > > > > > Hello all, > > > > > > > > I got the time out error as mentioned below -- after 600 > > > seconds, > > > > that attempt was killed and the attempt would be deemed a failure. I > > > > searched around about this error, and one of the suggestions to > include > > > > "progress" statements in the reducer -- it might be taking longer > than > > > 600 > > > > seconds and so is timing out. I added calls to context.progress() and > > > > context.setStatus(str) in the reducer. Now, it works fine -- there > are > > no > > > > timeout errors. > > > > > > > > But, for a few jobs, it takes awfully long time to move from > > > "Map > > > > 100%, Reduce 99%" to Reduce 100%. For some jobs its 15mins and for > some > > > it > > > > was more than an hour. The reduce code is not complex -- 2 level loop > > and > > > > couple of if-else blocks. The input size is also not huge, for the > job > > > that > > > > gets struck for an hour at reduce 99%, it would take in 130. Some of > > them > > > > are 1-3 MB in size and couple of them are 16MB in size. > > > > > > > > Has anyone encountered this problem before? Any pointers? I > > use > > > > Hadoop 0.20.2 on a linux cluster of 16 nodes. > > > > > > > > Thank you. > > > > > > > > Regards, > > > > Raghava. > > > > > > > > On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju < > > > > m.vijayaragh...@gmail.com> wrote: > > > > > > > > Hi all, > > > > > > > > I am running a series of jobs one after another. While > executing > > > the > > > > 4th job, the job fails. It fails in the reducer --- the progress > > > percentage > > > > would be map 100%, reduce 99%. It gives out the following message > > > > > > > > 10/04/01 01:04:15 INFO mapred.JobClient: Task Id : > > > > attempt_201003240138_0110_r_000018_1, Status : FAILED > > > > Task attempt_201003240138_0110_r_000018_1 failed to report status for > > 602 > > > > seconds. Killing! > > > > > > > > It makes several attempts again to execute it but fails with similar > > > > message. I couldn't get anything from this error message and wanted > to > > > look > > > > at logs (located in the default dir of ${HADOOP_HOME/logs}). But I > > don't > > > > find any files which match the timestamp of the job. Also I did not > > find > > > > history and userlogs in the logs folder. Should I look at some other > > > place > > > > for the logs? What could be the possible causes for the above error? > > > > > > > > I am using Hadoop 0.20.2 and I am running it on a cluster with > > 16 > > > > nodes. > > > > > > > > Thank you. > > > > > > > > Regards, > > > > Raghava. > > > > > > > > > > > > > > > > > > > > > >