[ 
https://issues.apache.org/jira/browse/HADOOP-4716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651357#action_12651357
 ] 

Amar Kamat commented on HADOOP-4716:
------------------------------------

The JobTracker upon restart rebuilds the _task-completion-event_ list. Here 
there are events from the tracker which was lost upon restart. When the 
task-tracker (re)connects it re-sizes its own _task-completion-event_ list. 
Hence the tracker retains the missing map's events. After some time the 
jobtracker finds out that the tracker is lost and kills all the maps that were 
run on the lost tracker and re-executes them. The tracker will have the 
_task-completion-event_ list like 
{code}
1. SUC m1-t1
2. SUC m2-t2
3. SUC m3-t1
4. SUC m4-t2
5. KIL m1-t1
6. KIL m3-t1
7. SUC m1-t2
8. SUC m3-t2
{code}
The reducer takes _m1-t1_ and starts pulling map output from _t1_. Note that 
when the reducer fails on _m1_ it checks that _m1_ is _OBSOLETE_ and then 
ignores it. The test case times out because it takes fair amount of time 
(~3mins) to fail once. So this doesnt look like a bug but a limitation. The 
reason this issue is not commonly seen  is because the reducer actually starts 
late and hence the tracker has the latest updates which prevents the reducer to 
take up maps from the lost tracker. I could easily reproduce this problem when 
the reducer was scheduled early. 
----
One thing that can be done here is to make _num-reducers=0_ as the test case 
doesnt actually require reducers. But actually its better to have reducers as 
it makes the testcase strict and hence better. So if we decide to keep reducers 
then there should be some way to control the timeout (~3min --> ~5 secs). 
Thoughts?

> testRestartWithLostTracker frequently times out
> -----------------------------------------------
>
>                 Key: HADOOP-4716
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4716
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Johan Oskarsson
>            Assignee: Amar Kamat
>            Priority: Minor
>             Fix For: 0.20.0
>
>         Attachments: log.txt
>
>
> This test frequently times out: 
> org.apache.hadoop.mapred.TestJobTrackerRestartWithLostTracker.testRestartWithLostTracker
> Example: 
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3637/testReport/org.apache.hadoop.mapred/TestJobTrackerRestartWithLostTracker/testRestartWithLostTracker/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to