[ https://issues.apache.org/jira/browse/AMBARI-18929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15690127#comment-15690127 ]
Di Li commented on AMBARI-18929: -------------------------------- [~cheersyang] Principal-wise the two service checks seem to fail for the same reason, they both fail prematurely when the logic hits a dead node. Implementation-wise they are different. Yarn has the following logic ( also with a very short timeout in my opinion). As you can see, it assumes both RM are online, it should also handle curl exit code for better error handling. for rm_webapp_address in params.rm_webapp_addresses_list: info_app_url = params.scheme + "://" + rm_webapp_address + "/ws/v1/cluster/apps/" + application_name get_app_info_cmd = "curl --negotiate -u : -ks --location-trusted --connect-timeout " + CURL_CONNECTION_TIMEOUT + " " + info_app_url return_code, stdout, _ = get_user_call_output(get_app_info_cmd, user=params.smokeuser, path='/usr/sbin:/sbin:/usr/local/bin:/bin:/usr/bin', ) # Handle HDP<2.2.8.1 where RM doesn't do automatic redirection from standby to active if stdout.startswith("This is standby RM. Redirecting to the current active RM:"): Logger.info(format("Skipped checking of {rm_webapp_address} since returned '{stdout}'")) continue For HDFS, it's a two-path approach, I haven't run it but I suspect it'd be the second part that fails on checkWebUI.py logic? If so, it'd be the same suggestion, better error handling to continue with the check until all hosts are pinged. > Yarn service check fails when either resource manager is down in HA enabled > cluster > ----------------------------------------------------------------------------------- > > Key: AMBARI-18929 > URL: https://issues.apache.org/jira/browse/AMBARI-18929 > Project: Ambari > Issue Type: Bug > Components: ambari-server > Affects Versions: 2.4.0 > Reporter: Weiwei Yang > > When HA is enabled, yarn service_check.py fails if one of RM is down, even > the other one is active. This gives user the wrong impression the yarn > cluster is not healthy. Instead, service check should pass, or at least pass > with warning that lets user know there is one RM down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)