[
https://issues.apache.org/jira/browse/MAPREDUCE-7264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
tuyu updated MAPREDUCE-7264:
----------------------------
Target Version/s: (was: 3.2.1)
> overall reduction of ApplicationMaster exit because of unhandled
> TA_TOO_MANY_FETCH_FAILURE event
> ------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-7264
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7264
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster
> Affects Versions: 3.2.1
> Reporter: tuyu
> Priority: Critical
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
>
> when rolling restart nodemanager, some mapreduce job will exit because of
> unhandle TA_TOO_MANY_FETCH_FAILURE event
> details:
> if task stay in SUCCEEDED state, now reciveice TA_TOO_MANY_FETCH_FAILURE
> event,AM will handle this situation correct,but if stay in
> SUCCESS_FINISHING_CONTAINER or some other state,will exit by invalid event
> [YARN-1469|https://issues.apache.org/jira/browse/YARN-1469]
> [MAPREDUCE-7240|https://issues.apache.org/jira/browse/MAPREDUCE-7240][MAPREDUCE-7249|https://issues.apache.org/jira/browse/MAPREDUCE-7249]
> [MAPREDUCE-7240|https://issues.apache.org/jira/browse/MAPREDUCE-5409]
> reason:
> when map task send done rpc to AM, AM will Transition this task to
> SUCCESS_FINISHING_CONTAINER state, and add this task to
> mapAttemptCompletionEvents List, when reduce send
> getMapAttemptCompletionEvents rpc to get the complete map, the task stay in
> SUCCESS_FINISHING_CONTAINER state will return. but if now,NM is restart or
> stop,many reducer task will shuffle fail,and report to AM, AM will send
> TA_TOO_MANY_FETCH_FAILURE event,if map task state cannot handle
> TA_TOO_MANY_FETCH_FAILURE event,AM will exit.
> i found isusses to resolve this problem,but not cover all situation.
> The state Transition from SUCCESS_FINISHING_CONTAINER will reciveice
> TA_TOO_MANY_FETCH_FAILURE event,like
> (SUCCEEDED,SUCCESS_CONTAINER_CLEANUP,SUCCESS_FINISHING_CONTAINER,FAILED,KILL_CONTAINER_CLEANUP)
> In hadoop 3.2.1, only SUCCEEDED,FAILED AND KILLED state can handle
> TA_TOO_MANY_FETCH_FAILURE event, and some jira to fix
> SUCCESS_CONTAINER_CLEANUP,SUCCESS_FINISHING_CONTAINER,KILLED,but
> KILL_CONTAINER_CLEANUP,KILL_TASK_CLEANUP also should to handle
> TA_TOO_MANY_FETCH_FAILURE event
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]