Nicolas Fraison created MAPREDUCE-6982:
------------------------------------------
Summary: Containers on lost nodes are considered failed after a
too long time.
Key: MAPREDUCE-6982
URL: https://issues.apache.org/jira/browse/MAPREDUCE-6982
Project: Hadoop Map/Reduce
Issue Type: Bug
Components: mr-am
Affects Versions: 2.6.0
Environment: cdh5.5.0
Reporter: Nicolas Fraison
Priority: Minor
Containers on lost nodes (nodemanager being unavailable or server being
unavailable) are considered failed after a too long time.
This is due to the AppMaster trying to cleanup the container on the unavailable
node.
The proposed path will limit the impact of this timeout by managing NodeManager
lost events on AM as described below:
* on nodemanager service unavailibility (crash, oom ...):
When receiving lost NodeManager events, it failed the impacted attempt and
do not go through the cleanup stage.
* on nodemanager server unavailibility with default settings AM detect
first that the attempt is in timeout and try to cleanup the attempt:
When receiving lost NodeManager events, it stop the cleanup process on the
impacted container and failed the attempt.
This reduce the duration of the timeout to the timeout for detecting a
NodeManager down.
Similar issue than
[MAPREDUCE-6659|https://issues.apache.org/jira/browse/MAPREDUCE-6659] on which
I can't attached the patch.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]