[
https://issues.apache.org/jira/browse/AMBARI-18929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15690127#comment-15690127
]
Di Li commented on AMBARI-18929:
--------------------------------
[~cheersyang] Principal-wise the two service checks seem to fail for the same
reason, they both fail prematurely when the logic hits a dead node.
Implementation-wise they are different. Yarn has the following logic ( also
with a very short timeout in my opinion). As you can see, it assumes both RM
are online, it should also handle curl exit code for better error handling.
for rm_webapp_address in params.rm_webapp_addresses_list:
info_app_url = params.scheme + "://" + rm_webapp_address +
"/ws/v1/cluster/apps/" + application_name
get_app_info_cmd = "curl --negotiate -u : -ks --location-trusted
--connect-timeout " + CURL_CONNECTION_TIMEOUT + " " + info_app_url
return_code, stdout, _ = get_user_call_output(get_app_info_cmd,
user=params.smokeuser,
path='/usr/sbin:/sbin:/usr/local/bin:/bin:/usr/bin',
)
# Handle HDP<2.2.8.1 where RM doesn't do automatic redirection from
standby to active
if stdout.startswith("This is standby RM. Redirecting to the current
active RM:"):
Logger.info(format("Skipped checking of {rm_webapp_address} since
returned '{stdout}'"))
continue
For HDFS, it's a two-path approach, I haven't run it but I suspect it'd be the
second part that fails on checkWebUI.py logic? If so, it'd be the same
suggestion, better error handling to continue with the check until all hosts
are pinged.
> Yarn service check fails when either resource manager is down in HA enabled
> cluster
> -----------------------------------------------------------------------------------
>
> Key: AMBARI-18929
> URL: https://issues.apache.org/jira/browse/AMBARI-18929
> Project: Ambari
> Issue Type: Bug
> Components: ambari-server
> Affects Versions: 2.4.0
> Reporter: Weiwei Yang
>
> When HA is enabled, yarn service_check.py fails if one of RM is down, even
> the other one is active. This gives user the wrong impression the yarn
> cluster is not healthy. Instead, service check should pass, or at least pass
> with warning that lets user know there is one RM down.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)