[jira] [Updated] (MAPREDUCE-7264) overall reduction of ApplicationMaster exit because of unhandled TA_TOO_MANY_FETCH_FAILURE event

tuyu (Jira) Wed, 19 Feb 2020 02:19:10 -0800


     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


tuyu updated MAPREDUCE-7264:
----------------------------
    Description: 
when rolling restart nodemanager, some mapreduce job will exit because of 
unhandle TA_TOO_MANY_FETCH_FAILURE event

details:
   if task stay in SUCCEEDED state, now reciveice  TA_TOO_MANY_FETCH_FAILURE 
event,AM will handle this situation correct,but if stay in 
SUCCESS_FINISHING_CONTAINER or some other state,will exit by invalid event 
[YARN-1469|https://issues.apache.org/jira/browse/YARN-1469] 
[MAPREDUCE-7240|https://issues.apache.org/jira/browse/MAPREDUCE-7240][MAPREDUCE-7249|https://issues.apache.org/jira/browse/MAPREDUCE-7249]
 [MAPREDUCE-7240|https://issues.apache.org/jira/browse/MAPREDUCE-5409]
   reason:
   when map task send done rpc to AM, AM will Transition this task to 

SUCCESS_FINISHING_CONTAINER state, and add this task to 

mapAttemptCompletionEvents List, when reduce send 

getMapAttemptCompletionEvents rpc to get the complete map, the task stay in 
SUCCESS_FINISHING_CONTAINER state will return. but if now,NM is restart or 
stop,many reducer task will shuffle fail,and report to AM, AM  will send 
TA_TOO_MANY_FETCH_FAILURE event,if map task state cannot handle 
TA_TOO_MANY_FETCH_FAILURE event,AM will exit.

i found isusses to resolve this problem,but not cover all situation.

The state Transition from SUCCESS_FINISHING_CONTAINER will reciveice 
TA_TOO_MANY_FETCH_FAILURE event，like 
(SUCCEEDED,SUCCESS_CONTAINER_CLEANUP,SUCCESS_FINISHING_CONTAINER,FAILED,KILL_CONTAINER_CLEANUP)

In hadoop 3.2.1, only SUCCEEDED,FAILED AND KILLED state can handle 
TA_TOO_MANY_FETCH_FAILURE event, and some jira to fix 
SUCCESS_CONTAINER_CLEANUP,SUCCESS_FINISHING_CONTAINER,KILLED,but  
KILL_CONTAINER_CLEANUP,KILL_TASK_CLEANUP  also should to handle 
TA_TOO_MANY_FETCH_FAILURE event


  was:
when rolling restart nodemanager, some mapreduce job will exit because of 
unhandle TA_TOO_MANY_FETCH_FAILURE event

details:
   if task stay in SUCCEEDED state, now reciveice  TA_TOO_MANY_FETCH_FAILURE 
event,AM will handle this situation correct,but if stay in 
SUCCESS_FINISHING_CONTAINER or some other state,will exit by invalid event 
[YARN-1469|https://issues.apache.org/jira/browse/YARN-1469] 
[MAPREDUCE-7240|https://issues.apache.org/jira/browse/MAPREDUCE-7240][MAPREDUCE-7249|https://issues.apache.org/jira/browse/MAPREDUCE-7249]
 [MAPREDUCE-7240|https://issues.apache.org/jira/browse/MAPREDUCE-5409]
   reason:
   when map task send done rpc to AM, AM will Transition this task to 

SUCCESS_FINISHING_CONTAINER state, and add this task to 

mapAttemptCompletionEvents List, when reduce send 

getMapAttemptCompletionEvents rpc to get the complete map, the task stay in 
SUCCESS_FINISHING_CONTAINER state will return. but if now,NM is restart or 
stop,many reducer task will shuffle fail,and report to AM, AM  will send 
TA_TOO_MANY_FETCH_FAILURE event,if map task state cannot handle 
TA_TOO_MANY_FETCH_FAILURE event,AM will exit.

i found isusses to resolve this problem,but not cover all situation.

The state Transition from SUCCESS_FINISHING_CONTAINER will reciveice 



> overall reduction of ApplicationMaster exit because of unhandled 
> TA_TOO_MANY_FETCH_FAILURE event
> ------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-7264
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7264
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 3.2.1
>            Reporter: tuyu
>            Priority: Critical
>             Fix For: 3.2.1
>
>
> when rolling restart nodemanager, some mapreduce job will exit because of 
> unhandle TA_TOO_MANY_FETCH_FAILURE event
> details:
>    if task stay in SUCCEEDED state, now reciveice  TA_TOO_MANY_FETCH_FAILURE 
> event,AM will handle this situation correct,but if stay in 
> SUCCESS_FINISHING_CONTAINER or some other state,will exit by invalid event 
> [YARN-1469|https://issues.apache.org/jira/browse/YARN-1469] 
> [MAPREDUCE-7240|https://issues.apache.org/jira/browse/MAPREDUCE-7240][MAPREDUCE-7249|https://issues.apache.org/jira/browse/MAPREDUCE-7249]
>  [MAPREDUCE-7240|https://issues.apache.org/jira/browse/MAPREDUCE-5409]
>    reason:
>    when map task send done rpc to AM, AM will Transition this task to 
> SUCCESS_FINISHING_CONTAINER state, and add this task to 
> mapAttemptCompletionEvents List, when reduce send 
> getMapAttemptCompletionEvents rpc to get the complete map, the task stay in 
> SUCCESS_FINISHING_CONTAINER state will return. but if now,NM is restart or 
> stop,many reducer task will shuffle fail,and report to AM, AM  will send 
> TA_TOO_MANY_FETCH_FAILURE event,if map task state cannot handle 
> TA_TOO_MANY_FETCH_FAILURE event,AM will exit.
> i found isusses to resolve this problem,but not cover all situation.
> The state Transition from SUCCESS_FINISHING_CONTAINER will reciveice 
> TA_TOO_MANY_FETCH_FAILURE event，like 
> (SUCCEEDED,SUCCESS_CONTAINER_CLEANUP,SUCCESS_FINISHING_CONTAINER,FAILED,KILL_CONTAINER_CLEANUP)
> In hadoop 3.2.1, only SUCCEEDED,FAILED AND KILLED state can handle 
> TA_TOO_MANY_FETCH_FAILURE event, and some jira to fix 
> SUCCESS_CONTAINER_CLEANUP,SUCCESS_FINISHING_CONTAINER,KILLED,but  
> KILL_CONTAINER_CLEANUP,KILL_TASK_CLEANUP  also should to handle 
> TA_TOO_MANY_FETCH_FAILURE event



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (MAPREDUCE-7264) overall reduction of ApplicationMaster exit because of unhandled TA_TOO_MANY_FETCH_FAILURE event

Reply via email to