[ 
https://issues.apache.org/jira/browse/AMBARI-9894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yusaku Sako updated AMBARI-9894:
--------------------------------
    Assignee: Jonathan Hurley

> Alerts: YARN YM HA Alerts Are UNKNOWN Due to HA Redirects
> ---------------------------------------------------------
>
>                 Key: AMBARI-9894
>                 URL: https://issues.apache.org/jira/browse/AMBARI-9894
>             Project: Ambari
>          Issue Type: Bug
>            Reporter: Jonathan Hurley
>            Assignee: Jonathan Hurley
>            Priority: Critical
>             Fix For: 2.0.0
>
>         Attachments: AMBARI-9894.patch
>
>
> 3-node cluster
> Configured ResourceManager HA. Three alerts are now Unknown:
> - ResourceManager RPC Latency. Has two instances as expected but each is 
> unknown "No JSON object could be decoded".
> - NodeManger Health Summary. Has two instances as expected but each is 
> unknown "No JSON object could be decoded".
> - ResourceManager CPU Utiliz. Has two instances as expected but each is 
> unknown "No JSON object could be decoded".
> Both RMs are running and I can quick llink over to RMUI + JMX.
> The reason this fails is because YARN forwards requests for the standby RM to 
> the active one. In this scenario, the alert gets back an HTTP 200 response 
> that looks like:
> {noformat}
> This is standby RM. Redirecting to the current active RM: 
> http://c6403.ambari.apache.org:8088/
> {noformat}
> Unfortunately, this is a refresh header redirect which is not able to be 
> handled by the metric alert. The reason that the alerts work is that after 
> the VMs restarted, the original RM became active again. 
> There are a few issues here:
> - YARN doesn't do HA in the same way that other services like HDFS do. As a 
> result, there's no config property that could let the alert know what to do 
> or which hosts to contact.
> - YARN actually forwards after an HTTP 200 to the active node, which doesn't 
> jive with how alerts works.
> This is a definite problem and requires some further investigation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to