[ 
https://issues.apache.org/jira/browse/HADOOP-3780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12614307#action_12614307
 ] 

Amar Kamat commented on HADOOP-3780:
------------------------------------

The problem that we are facing is as follows :

||SYM||Stands for||Description|| Used for||
|IC | Initial contact | whether the TT is connected to the JT or not, TT's 
point of view | Re-init/Sync the TT|
|SB | Seen before | whether there are some previous status entries | Mark a TT 
as lost|
|HBE | Heartbeat entry | whether the TT is connected/registered, JT's point of 
view | Re-init/Sync the TT|
|JTR| JT restarted | Whether the JT has restarted | Re-init/Sync the TT|

Rules :

||IC||HBE||SB||JTR||Action||
|false|false|-|true|SYNC|
|false|false|-|false|Re-init|
|false|true|-|-|Re-send prev response|
|true|-|true|-|Mark lost (kill tasks)|
|false|-|false|-|make SB false i.e clear previous status entries|


{noformat}

0) JT restarts and hence HBE for all TT's will be false.
1) TT connects to the restarted JT with IC=false
2) JT sends a SYNC
3) TT uploads the task statuses
4) JT (as a part of heartbeat) tries to update the task states/status
5) If (4) is successful : JT makes an HBE=true for this TT
6) If (4) fails : the JT has made some changes in the task states but HBE=false.
     Consider task t being marked as SUCCEEDED before the SYNC fails.
7) TT comes back with IC = false
8) IC == false && HBE == false && JTR == true .... JT sends a SYNC again
9) TT responds back with IC = true and all updates
10) JT tries (4) again. Since IC == true and SB == true, JT consider this TT as 
lost.
11) This causes the task t to be marked as KILLED
12) In the same method the status updates are applied and hence t will be 
marked as SUCCEEDED
13) Now we have task completion events with a same task marked as KILLED and 
SUCCEEDED.
14) Since task t is marked as SUCCEEDED later, the JT assumes that the TIP is 
completed while the reducers keep on ignoring the task t's output.
15) Job stucks
{noformat}

This problem will not occur if {{(4)}} succeeds without any problem i.e every 
{{SYNC}} should make HBE = true. {{4}} can only fail if the tracker is not 
resolved. Hence inline resolution solves the problem.


> JobTracker should synchronously resolve the tasktracker's network location 
> when the tracker registers
> -----------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3780
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3780
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Amar Kamat
>
> This issue is inspired by HADOOP-3620. In JobTracker, the network address of 
> tracker gets resolved asynchronously. Now it can be done inline i.e while the 
> trackers register. This is of great help for HADOOP-3245 where this 
> enhancement makes the design simpler.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to