Re: Reduce gets struck at 99%

Eric Arenas Thu, 08 Apr 2010 11:28:14 -0700

Yes Raghava,

I have experience that issue before, and the solution that you mentioned also 
solved my issue (adding a context.progress or setcontext to tell the JT that my 
jobs are still running)


regards
 Eric Arenas




________________________________
From: Raghava Mutharaju <m.vijayaragh...@gmail.com>
To: common-u...@hadoop.apache.org; mapreduce-user@hadoop.apache.org
Sent: Thu, April 8, 2010 10:30:49 AM
Subject: Reduce gets struck at 99%

Hello all,

         I got the time out error as mentioned below -- after 600 seconds, that 
attempt was killed and the attempt would be deemed a failure. I searched around 
about this error, and one of the suggestions to include "progress" statements 
in the reducer -- it might be taking longer than 600 seconds and so is timing 
out. I added calls to context.progress() and context.setStatus(str) in the 
reducer. Now, it works fine -- there are no timeout errors.

         But, for a few jobs, it takes awfully long time to move from "Map 
100%, Reduce 99%" to Reduce 100%. For some jobs its 15mins and for some it was 
more than an hour. The reduce code is not complex -- 2 level loop and couple of 
if-else blocks. The input size is also not huge, for the job that gets struck 
for an hour at reduce 99%, it would take in 130. Some of them are 1-3 MB in 
size and couple of them are 16MB in size. 

         Has anyone encountered this problem before? Any pointers? I use Hadoop 
0.20.2 on a linux cluster of 16 nodes.

Thank you.

Regards,
Raghava.


On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju <m.vijayaragh...@gmail.com> 
wrote:

Hi all,
>
>       I am running a series of jobs one after another. While executing the 
> 4th job, the job fails. It fails in the reducer --- the progress percentage 
> would be map 100%, reduce 99%. It gives out the following message
>
>10/04/01 01:04:15 INFO mapred.JobClient: Task Id : 
>attempt_201003240138_0110_r_000018_1, Status : FAILED 
>Task attempt_201003240138_0110_r_000018_1 failed to report status for 602 
>seconds. Killing!
>
>It makes several attempts again to execute it but fails with similar message. 
>I couldn't get anything from this error message and wanted to look at logs 
>(located in the default dir of ${HADOOP_HOME/logs}). But I don't find any 
>files which match the timestamp of the job. Also I did not find history and 
>userlogs in the logs folder. Should I look at some other place for the logs? 
>What could be the possible causes for the above error?
>
>       I am using Hadoop 0.20.2 and I am running it on a cluster with 16 nodes.
>
>Thank you.
>
>Regards,
>Raghava.
>

Re: Reduce gets struck at 99%

Reply via email to