[
https://issues.apache.org/jira/browse/MAPREDUCE-7264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
tuyu updated MAPREDUCE-7264:
----------------------------
Description:
when rolling restart nodemanager, some mapreduce job will exit because of
unhandle TA_TOO_MANY_FETCH_FAILURE event
details:
if task stay in SUCCEEDED state, now reciveice TA_TOO_MANY_FETCH_FAILURE
event,AM will handle this situation correct,but if stay in
SUCCESS_FINISHING_CONTAINER or some other state,will exit by invalid event
[YARN-1469|https://issues.apache.org/jira/browse/YARN-1469]
[MAPREDUCE-7240|https://issues.apache.org/jira/browse/MAPREDUCE-7240][MAPREDUCE-7249|https://issues.apache.org/jira/browse/MAPREDUCE-7249]
[MAPREDUCE-7240|https://issues.apache.org/jira/browse/MAPREDUCE-5409]
reason:
when map task send done rpc to AM, AM will Transition this task to
SUCCESS_FINISHING_CONTAINER state, and add this task to
mapAttemptCompletionEvents List, when reduce send
getMapAttemptCompletionEvents rpc to get the complete map, the task stay in
SUCCESS_FINISHING_CONTAINER state will return. but if now,NM is restart or
stop,many reducer task will shuffle fail,and report to AM, AM will send
TA_TOO_MANY_FETCH_FAILURE event,if map task state cannot handle
TA_TOO_MANY_FETCH_FAILURE event,AM will exit.
i found isusses to resolve this problem,but not cover all situation.
The state Transition from SUCCESS_FINISHING_CONTAINER will reciveice
TA_TOO_MANY_FETCH_FAILURE event,like
(SUCCEEDED,SUCCESS_CONTAINER_CLEANUP,SUCCESS_FINISHING_CONTAINER,FAILED,KILL_CONTAINER_CLEANUP)
In hadoop 3.2.1, only SUCCEEDED,FAILED AND KILLED state can handle
TA_TOO_MANY_FETCH_FAILURE event, and some jira to fix
SUCCESS_CONTAINER_CLEANUP,SUCCESS_FINISHING_CONTAINER,KILLED,but
KILL_CONTAINER_CLEANUP,KILL_TASK_CLEANUP also should to handle
TA_TOO_MANY_FETCH_FAILURE event
was:
when rolling restart nodemanager, some mapreduce job will exit because of
unhandle TA_TOO_MANY_FETCH_FAILURE event
details:
if task stay in SUCCEEDED state, now reciveice TA_TOO_MANY_FETCH_FAILURE
event,AM will handle this situation correct,but if stay in
SUCCESS_FINISHING_CONTAINER or some other state,will exit by invalid event
[YARN-1469|https://issues.apache.org/jira/browse/YARN-1469]
[MAPREDUCE-7240|https://issues.apache.org/jira/browse/MAPREDUCE-7240][MAPREDUCE-7249|https://issues.apache.org/jira/browse/MAPREDUCE-7249]
[MAPREDUCE-7240|https://issues.apache.org/jira/browse/MAPREDUCE-5409]
reason:
when map task send done rpc to AM, AM will Transition this task to
SUCCESS_FINISHING_CONTAINER state, and add this task to
mapAttemptCompletionEvents List, when reduce send
getMapAttemptCompletionEvents rpc to get the complete map, the task stay in
SUCCESS_FINISHING_CONTAINER state will return. but if now,NM is restart or
stop,many reducer task will shuffle fail,and report to AM, AM will send
TA_TOO_MANY_FETCH_FAILURE event,if map task state cannot handle
TA_TOO_MANY_FETCH_FAILURE event,AM will exit.
i found isusses to resolve this problem,but not cover all situation.
The state Transition from SUCCESS_FINISHING_CONTAINER will reciveice
> overall reduction of ApplicationMaster exit because of unhandled
> TA_TOO_MANY_FETCH_FAILURE event
> ------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-7264
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7264
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster
> Affects Versions: 3.2.1
> Reporter: tuyu
> Priority: Critical
> Fix For: 3.2.1
>
>
> when rolling restart nodemanager, some mapreduce job will exit because of
> unhandle TA_TOO_MANY_FETCH_FAILURE event
> details:
> if task stay in SUCCEEDED state, now reciveice TA_TOO_MANY_FETCH_FAILURE
> event,AM will handle this situation correct,but if stay in
> SUCCESS_FINISHING_CONTAINER or some other state,will exit by invalid event
> [YARN-1469|https://issues.apache.org/jira/browse/YARN-1469]
> [MAPREDUCE-7240|https://issues.apache.org/jira/browse/MAPREDUCE-7240][MAPREDUCE-7249|https://issues.apache.org/jira/browse/MAPREDUCE-7249]
> [MAPREDUCE-7240|https://issues.apache.org/jira/browse/MAPREDUCE-5409]
> reason:
> when map task send done rpc to AM, AM will Transition this task to
> SUCCESS_FINISHING_CONTAINER state, and add this task to
> mapAttemptCompletionEvents List, when reduce send
> getMapAttemptCompletionEvents rpc to get the complete map, the task stay in
> SUCCESS_FINISHING_CONTAINER state will return. but if now,NM is restart or
> stop,many reducer task will shuffle fail,and report to AM, AM will send
> TA_TOO_MANY_FETCH_FAILURE event,if map task state cannot handle
> TA_TOO_MANY_FETCH_FAILURE event,AM will exit.
> i found isusses to resolve this problem,but not cover all situation.
> The state Transition from SUCCESS_FINISHING_CONTAINER will reciveice
> TA_TOO_MANY_FETCH_FAILURE event,like
> (SUCCEEDED,SUCCESS_CONTAINER_CLEANUP,SUCCESS_FINISHING_CONTAINER,FAILED,KILL_CONTAINER_CLEANUP)
> In hadoop 3.2.1, only SUCCEEDED,FAILED AND KILLED state can handle
> TA_TOO_MANY_FETCH_FAILURE event, and some jira to fix
> SUCCESS_CONTAINER_CLEANUP,SUCCESS_FINISHING_CONTAINER,KILLED,but
> KILL_CONTAINER_CLEANUP,KILL_TASK_CLEANUP also should to handle
> TA_TOO_MANY_FETCH_FAILURE event
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]