[
https://issues.apache.org/jira/browse/HADOOP-3780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12614307#action_12614307
]
Amar Kamat commented on HADOOP-3780:
------------------------------------
The problem that we are facing is as follows :
||SYM||Stands for||Description|| Used for||
|IC | Initial contact | whether the TT is connected to the JT or not, TT's
point of view | Re-init/Sync the TT|
|SB | Seen before | whether there are some previous status entries | Mark a TT
as lost|
|HBE | Heartbeat entry | whether the TT is connected/registered, JT's point of
view | Re-init/Sync the TT|
|JTR| JT restarted | Whether the JT has restarted | Re-init/Sync the TT|
Rules :
||IC||HBE||SB||JTR||Action||
|false|false|-|true|SYNC|
|false|false|-|false|Re-init|
|false|true|-|-|Re-send prev response|
|true|-|true|-|Mark lost (kill tasks)|
|false|-|false|-|make SB false i.e clear previous status entries|
{noformat}
0) JT restarts and hence HBE for all TT's will be false.
1) TT connects to the restarted JT with IC=false
2) JT sends a SYNC
3) TT uploads the task statuses
4) JT (as a part of heartbeat) tries to update the task states/status
5) If (4) is successful : JT makes an HBE=true for this TT
6) If (4) fails : the JT has made some changes in the task states but HBE=false.
Consider task t being marked as SUCCEEDED before the SYNC fails.
7) TT comes back with IC = false
8) IC == false && HBE == false && JTR == true .... JT sends a SYNC again
9) TT responds back with IC = true and all updates
10) JT tries (4) again. Since IC == true and SB == true, JT consider this TT as
lost.
11) This causes the task t to be marked as KILLED
12) In the same method the status updates are applied and hence t will be
marked as SUCCEEDED
13) Now we have task completion events with a same task marked as KILLED and
SUCCEEDED.
14) Since task t is marked as SUCCEEDED later, the JT assumes that the TIP is
completed while the reducers keep on ignoring the task t's output.
15) Job stucks
{noformat}
This problem will not occur if {{(4)}} succeeds without any problem i.e every
{{SYNC}} should make HBE = true. {{4}} can only fail if the tracker is not
resolved. Hence inline resolution solves the problem.
> JobTracker should synchronously resolve the tasktracker's network location
> when the tracker registers
> -----------------------------------------------------------------------------------------------------
>
> Key: HADOOP-3780
> URL: https://issues.apache.org/jira/browse/HADOOP-3780
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Reporter: Amar Kamat
>
> This issue is inspired by HADOOP-3620. In JobTracker, the network address of
> tracker gets resolved asynchronously. Now it can be done inline i.e while the
> trackers register. This is of great help for HADOOP-3245 where this
> enhancement makes the design simpler.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.