[jira] [Commented] (AMBARI-18929) Yarn service check fails when either resource manager is down in HA enabled cluster

Di Li (JIRA) Wed, 23 Nov 2016 05:42:14 -0800

    [ 
https://issues.apache.org/jira/browse/AMBARI-18929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15690127#comment-15690127
 ]


Di Li commented on AMBARI-18929:
--------------------------------

[~cheersyang] Principal-wise the two service checks seem to fail for the same 
reason, they both fail prematurely when the logic hits a dead node.

Implementation-wise they are different.  Yarn has the following logic ( also 
with a very short timeout in my opinion). As you can see, it assumes both RM 
are online, it should also handle curl exit code for better error handling.

for rm_webapp_address in params.rm_webapp_addresses_list:
      info_app_url = params.scheme + "://" + rm_webapp_address + 
"/ws/v1/cluster/apps/" + application_name

      get_app_info_cmd = "curl --negotiate -u : -ks --location-trusted 
--connect-timeout " + CURL_CONNECTION_TIMEOUT + " " + info_app_url

      return_code, stdout, _ = get_user_call_output(get_app_info_cmd,
                                            user=params.smokeuser,
                                            
path='/usr/sbin:/sbin:/usr/local/bin:/bin:/usr/bin',
                                            )

      # Handle HDP<2.2.8.1 where RM doesn't do automatic redirection from 
standby to active
      if stdout.startswith("This is standby RM. Redirecting to the current 
active RM:"):
        Logger.info(format("Skipped checking of {rm_webapp_address} since 
returned '{stdout}'"))
        continue

For HDFS, it's a two-path approach, I haven't run it but I suspect it'd be the 
second part that fails on checkWebUI.py logic? If so, it'd be the same 
suggestion, better error handling to continue with the check until all hosts 
are pinged.

> Yarn service check fails when either resource manager is down in HA enabled 
> cluster
> -----------------------------------------------------------------------------------
>
>                 Key: AMBARI-18929
>                 URL: https://issues.apache.org/jira/browse/AMBARI-18929
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-server
>    Affects Versions: 2.4.0
>            Reporter: Weiwei Yang
>
> When HA is enabled, yarn service_check.py fails if one of RM is down, even 
> the other one is active. This gives user the wrong impression the yarn 
> cluster is not healthy. Instead, service check should pass, or at least pass 
> with warning that lets user know there is one RM down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AMBARI-18929) Yarn service check fails when either resource manager is down in HA enabled cluster

Reply via email to