[ 
https://issues.apache.org/jira/browse/HADOOP-5367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thibaut updated HADOOP-5367:
----------------------------

    Description: 
Hi,

After I while, my cluster will only run the reduce tasks sequentially (each 
reducer running on the same node), the other nodes stay empty. The map phase 
however will run the jobs on all the nodes, also after such a "long" reduce 
phase has completed. But the reduce phase will then be again executed 
sequentially. This happens in my cluster after about 160 successfully completed 
jobs. (Some jobs have reducer set to 0!). 
As possible solution I have to restart the mapreduce service.

I didn't notice this behaviour in version 0.19.0. I can't use version 0.19.0 
because of the multipleoutput bug when setting reducers to 0.

Anoter site node which might be related. I also tried running the jobs with 
speculative execution set to on. My cluster would always hold back one reducer 
and only run it (in multiple instances) after the first of the other 6 reducers 
had finished, instead of launching all of them at the same time.


Below is a short extract from related logfile. It's full of these kind of 
entries.

09/02/28 12:48:07 INFO mapred.JobTracker: Serious problem.  While updating 
status, cannot find taskid attempt_200902271700_0051_r_000006_1
09/02/28 12:48:08 INFO mapred.JobTracker: Serious problem.  While updating 
status, cannot find taskid attempt_200902271700_0041_r_000002_1
09/02/28 12:48:08 INFO mapred.JobTracker: Serious problem.  While updating 
status, cannot find taskid attempt_200902271700_0083_r_000006_1
09/02/28 12:48:08 INFO mapred.JobTracker: Serious problem.  While updating 
status, cannot find taskid attempt_200902271700_0041_r_000005_1
09/02/28 12:48:10 INFO mapred.JobTracker: Serious problem.  While updating 
status, cannot find taskid attempt_200902271700_0105_r_000006_1
09/02/28 12:48:10 INFO mapred.JobTracker: Serious problem.  While updating 
status, cannot find taskid attempt_200902271700_0102_r_000006_1
09/02/28 12:48:12 INFO mapred.JobTracker: Serious problem.  While updating 
status, cannot find taskid attempt_200902271700_0051_r_000006_1
09/02/28 12:48:13 INFO mapred.JobTracker: Serious problem.  While updating 
status, cannot find taskid attempt_200902271700_0041_r_000002_1
09/02/28 12:48:13 INFO mapred.JobTracker: Serious problem.  While updating 
status, cannot find taskid attempt_200902271700_0083_r_000006_1
09/02/28 12:48:13 INFO mapred.JobTracker: Serious problem.  While updating 
status, cannot find taskid attempt_200902271700_0041_r_000005_1


  was:
Hi,

After I while, my cluster will only run the reduce tasks sequentially (each 
reducer running on the same node), the other nodes stay empty. The map phase 
however will run the jobs on all the nodes. This happens in my cluster after 
about 160 successfully completed jobs. (Some jobs have reducer set to 0!). 
As possible solution I have to restart the mapreduce service.

I didn't notice this behaviour in version 0.19.0. I can't use version 0.19.0 
because of the multipleoutput bug when setting reducers to 0.

Anoter site node which might be related. I also tried running the jobs with 
speculative execution set to on. My cluster would always hold back one reducer 
and only run it (in multiple instances) after the first of the other 6 reducers 
had finished, instead of launching all of them at the same time.


Below is a short extract from related logfile. It's full of these kind of 
entries.

09/02/28 12:48:07 INFO mapred.JobTracker: Serious problem.  While updating 
status, cannot find taskid attempt_200902271700_0051_r_000006_1
09/02/28 12:48:08 INFO mapred.JobTracker: Serious problem.  While updating 
status, cannot find taskid attempt_200902271700_0041_r_000002_1
09/02/28 12:48:08 INFO mapred.JobTracker: Serious problem.  While updating 
status, cannot find taskid attempt_200902271700_0083_r_000006_1
09/02/28 12:48:08 INFO mapred.JobTracker: Serious problem.  While updating 
status, cannot find taskid attempt_200902271700_0041_r_000005_1
09/02/28 12:48:10 INFO mapred.JobTracker: Serious problem.  While updating 
status, cannot find taskid attempt_200902271700_0105_r_000006_1
09/02/28 12:48:10 INFO mapred.JobTracker: Serious problem.  While updating 
status, cannot find taskid attempt_200902271700_0102_r_000006_1
09/02/28 12:48:12 INFO mapred.JobTracker: Serious problem.  While updating 
status, cannot find taskid attempt_200902271700_0051_r_000006_1
09/02/28 12:48:13 INFO mapred.JobTracker: Serious problem.  While updating 
status, cannot find taskid attempt_200902271700_0041_r_000002_1
09/02/28 12:48:13 INFO mapred.JobTracker: Serious problem.  While updating 
status, cannot find taskid attempt_200902271700_0083_r_000006_1
09/02/28 12:48:13 INFO mapred.JobTracker: Serious problem.  While updating 
status, cannot find taskid attempt_200902271700_0041_r_000005_1



> After some jobs have finished, Reducer will run new job's reduce tasks 
> sequentially and not in parallel (mapred.JobTracker: Serious problem.  While 
> updating status, cannot find taskid...)
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5367
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5367
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.19.1
>         Environment: State: RUNNING
> Started: Fri Feb 27 17:00:07 CET 2009
> Version: 0.19.1, r745977
> Compiled: Fri Feb 20 00:16:34 UTC 2009 by ndaley
>            Reporter: Thibaut
>            Priority: Critical
>
> Hi,
> After I while, my cluster will only run the reduce tasks sequentially (each 
> reducer running on the same node), the other nodes stay empty. The map phase 
> however will run the jobs on all the nodes, also after such a "long" reduce 
> phase has completed. But the reduce phase will then be again executed 
> sequentially. This happens in my cluster after about 160 successfully 
> completed jobs. (Some jobs have reducer set to 0!). 
> As possible solution I have to restart the mapreduce service.
> I didn't notice this behaviour in version 0.19.0. I can't use version 0.19.0 
> because of the multipleoutput bug when setting reducers to 0.
> Anoter site node which might be related. I also tried running the jobs with 
> speculative execution set to on. My cluster would always hold back one 
> reducer and only run it (in multiple instances) after the first of the other 
> 6 reducers had finished, instead of launching all of them at the same time.
> Below is a short extract from related logfile. It's full of these kind of 
> entries.
> 09/02/28 12:48:07 INFO mapred.JobTracker: Serious problem.  While updating 
> status, cannot find taskid attempt_200902271700_0051_r_000006_1
> 09/02/28 12:48:08 INFO mapred.JobTracker: Serious problem.  While updating 
> status, cannot find taskid attempt_200902271700_0041_r_000002_1
> 09/02/28 12:48:08 INFO mapred.JobTracker: Serious problem.  While updating 
> status, cannot find taskid attempt_200902271700_0083_r_000006_1
> 09/02/28 12:48:08 INFO mapred.JobTracker: Serious problem.  While updating 
> status, cannot find taskid attempt_200902271700_0041_r_000005_1
> 09/02/28 12:48:10 INFO mapred.JobTracker: Serious problem.  While updating 
> status, cannot find taskid attempt_200902271700_0105_r_000006_1
> 09/02/28 12:48:10 INFO mapred.JobTracker: Serious problem.  While updating 
> status, cannot find taskid attempt_200902271700_0102_r_000006_1
> 09/02/28 12:48:12 INFO mapred.JobTracker: Serious problem.  While updating 
> status, cannot find taskid attempt_200902271700_0051_r_000006_1
> 09/02/28 12:48:13 INFO mapred.JobTracker: Serious problem.  While updating 
> status, cannot find taskid attempt_200902271700_0041_r_000002_1
> 09/02/28 12:48:13 INFO mapred.JobTracker: Serious problem.  While updating 
> status, cannot find taskid attempt_200902271700_0083_r_000006_1
> 09/02/28 12:48:13 INFO mapred.JobTracker: Serious problem.  While updating 
> status, cannot find taskid attempt_200902271700_0041_r_000005_1

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to