Samrat002 opened a new pull request, #8208:
URL: https://github.com/apache/hadoop/pull/8208
<!--
Thanks for sending a pull request!
1. If this is your first time, please read our contributor guidelines:
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
2. Make sure your PR title starts with JIRA issue id, e.g.,
'HADOOP-17799. Your PR title ...'.
-->
### Description of PR
When hadoop cluster running on cloud , uses spot instance and AM is launched
on one of those instances. When these instances are removed then we have
observed too many AM Launch Failures due to Token Expired or Container
Liveliness Expiry when AM Launch Threads are busy retrying to connect to AM
Host (Spot Instances) which are down. Having Separate ThreadPools for both
Cleanup and Launch will reduce the AM Launch failures.
### Token Expired
```
2022-07-19 14:56:33,486 ERROR
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
(IPC Server handler 39 on 8041): Unauthorized request to start container.
This token is expired. current time is 1658242593486 found 1658242289457
Note: System times on machines may be out of sync. Check system time and
time zones.
```
### Container Liveliness Expiry
```
2022-07-19 16:06:48,663 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl
(ResourceManager Event Processor): container_xxxxxxxxxxxxx_xxxxxxx_xx_000001
Container Transitioned from ACQUIRED to EXPIRED
2022-07-19 16:10:08,663 INFO
org.apache.hadoop.yarn.util.AbstractLivelinessMonitor (Ping Checker):
Expired:<container=container_xxxxxxxxxxxxx_xxxxxxx_xx_000001, increase=false>
Timed out after 600 secs
```
Associated ticket :-
[YARN-11251](https://issues.apache.org/jira/browse/YARN-11251)
### How was this patch tested?
This patch is tested in EMR cluster where 1 master node and 1 core nodes ,
and 2 tasks nodes , task nodes are spot instances , we launched an AM in one of
the task node and bring it down , This replicate the following senerio
TODO :- unit test need to be added
### For code changes:
- [x] Does the title or this PR starts with the corresponding JIRA issue id
(e.g. 'HADOOP-17799. Your PR title ...')?
- [x] Object storage: have the integration tests been executed and the
endpoint declared according to the connector-specific documentation?
- [x] If adding new dependencies to the code, are these dependencies
licensed in a way that is compatible for inclusion under [ASF
2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [x] If applicable, have you updated the `LICENSE`, `LICENSE-binary`,
`NOTICE-binary` files?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]