[ 
https://issues.apache.org/jira/browse/AMBARI-18929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated AMBARI-18929:
---------------------------------
    Attachment: AMBARI-18929_trunk.patch

Hi [~Tim Thorpe], [~dili]

Attached a patch to fix this. With this patch, yarn service check first queries 
rest api {{http://<rm_host>:<port>/ws/v1/cluster/info}} to figure out the 
active rm address (this api is available since hadoop 2.3 the very first 
version to support HA), and this api is provided by both active and standby RMs 
as well as the non-HA env single RM, no redirection. Once active RM figured, 
the rest of logic remains same. Otherwise the service check will fail either 
because http service can not be accessed on both RMs, or both RMs are in 
standby state.

I tested this patch on following scenarios

HA environment
# Both active & standby RMs are up : SUCCESS
# Shutdown standby RM, active remains up : SUCCESS
# Shutdown active RM, active transited to the other RM : SUCCESS
# Shutdown zookeeper, both RMs are standby : FAIL
# Both RMs are down : FAIL

Non-HA environment
# RM is up : SUCCESS
# RM is down : FAIL

Please help to review the patch.

> Yarn service check fails when either resource manager is down in HA enabled 
> cluster
> -----------------------------------------------------------------------------------
>
>                 Key: AMBARI-18929
>                 URL: https://issues.apache.org/jira/browse/AMBARI-18929
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-server
>    Affects Versions: 2.4.0
>            Reporter: Weiwei Yang
>         Attachments: AMBARI-18929_trunk.patch
>
>
> When HA is enabled, yarn service_check.py fails if one of RM is down, even 
> the other one is active. This gives user the wrong impression the yarn 
> cluster is not healthy. Instead, service check should pass, or at least pass 
> with warning that lets user know there is one RM down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to