[
https://issues.apache.org/jira/browse/AMBARI-18929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Weiwei Yang updated AMBARI-18929:
---------------------------------
Attachment: AMBARI-18929_trunk.patch
Hi [~Tim Thorpe], [~dili]
Attached a patch to fix this. With this patch, yarn service check first queries
rest api {{http://<rm_host>:<port>/ws/v1/cluster/info}} to figure out the
active rm address (this api is available since hadoop 2.3 the very first
version to support HA), and this api is provided by both active and standby RMs
as well as the non-HA env single RM, no redirection. Once active RM figured,
the rest of logic remains same. Otherwise the service check will fail either
because http service can not be accessed on both RMs, or both RMs are in
standby state.
I tested this patch on following scenarios
HA environment
# Both active & standby RMs are up : SUCCESS
# Shutdown standby RM, active remains up : SUCCESS
# Shutdown active RM, active transited to the other RM : SUCCESS
# Shutdown zookeeper, both RMs are standby : FAIL
# Both RMs are down : FAIL
Non-HA environment
# RM is up : SUCCESS
# RM is down : FAIL
Please help to review the patch.
> Yarn service check fails when either resource manager is down in HA enabled
> cluster
> -----------------------------------------------------------------------------------
>
> Key: AMBARI-18929
> URL: https://issues.apache.org/jira/browse/AMBARI-18929
> Project: Ambari
> Issue Type: Bug
> Components: ambari-server
> Affects Versions: 2.4.0
> Reporter: Weiwei Yang
> Attachments: AMBARI-18929_trunk.patch
>
>
> When HA is enabled, yarn service_check.py fails if one of RM is down, even
> the other one is active. This gives user the wrong impression the yarn
> cluster is not healthy. Instead, service check should pass, or at least pass
> with warning that lets user know there is one RM down.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)