[ 
https://issues.apache.org/jira/browse/AMBARI-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jaimin D Jetly updated AMBARI-4530:
-----------------------------------

    Description: 
On a two host cluster and one of the agents was down.

First INSTALL attempt fails as tasks for the down agent time out and get 
aborted.

When INSTALL is retried, there are no tasks created for one host (as agent is 
down and thus host is in HEARTBEAT_LOST state).
{noformat}
06:38:55,649  INFO [qtp593591875-22] AmbariManagementControllerImpl:1147 - 
Command is not created for servicecomponenthost , clusterName=c1, clusterId=2, 
serviceName=HBASE, componentName=HBASE_MASTER, 
hostname=c6401.ambari.apache.org, hostState=HEARTBEAT_LOST, 
targetNewState=INSTALLED
{noformat}

However some tasks get created for the other agent and those succeed. At this 
point, FE assumes that install succeeded and then issues a START all. That 
results in state change errors we see in the log.
_FE assumption is based on the fact that all tasks created succeeded._

{noformat}
06:40:04,488 ERROR [qtp593591875-19] AbstractResourceProvider:302 - Caught 
AmbariException when modifying a resource
org.apache.ambari.server.AmbariException: Invalid transition for 
servicecomponenthost, clusterName=c1, clusterId=2, serviceName=ZOOKEEPER, 
componentName=ZOOKEEPER_SERVER, hostname=c6401.ambari.apache.org, 
currentState=INSTALL_FAILED, newDesiredState=STARTED
{noformat}

We should discuss possible solutions. One solution could be to have FE not 
issue a START if there are master components that are in INSTALL_FAILED state. 
In addition, if we can show that some hosts are in HEARTBEAT_LOST state then it 
can help user debug the situation. Other option is to have BE somehow indicate 
that tasks did not get created for host(s). In any case, when a host is down, 
we need a way to get out of the install wizard.

> Cluster install errors out strangely without starting services
> --------------------------------------------------------------
>
>                 Key: AMBARI-4530
>                 URL: https://issues.apache.org/jira/browse/AMBARI-4530
>             Project: Ambari
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 1.4.4
>            Reporter: Jaimin D Jetly
>            Assignee: Jaimin D Jetly
>             Fix For: 1.5.0
>
>         Attachments: AMBARI-4530.patch, AMBARI-4530_2.patch, Screen Shot 
> 2014-02-18 at 3.20.01 PM.png, Screen Shot 2014-02-18 at 3.31.01 PM.png, 
> Screen Shot 2014-02-18 at 3.31.16 PM.png
>
>
> On a two host cluster and one of the agents was down.
> First INSTALL attempt fails as tasks for the down agent time out and get 
> aborted.
> When INSTALL is retried, there are no tasks created for one host (as agent is 
> down and thus host is in HEARTBEAT_LOST state).
> {noformat}
> 06:38:55,649  INFO [qtp593591875-22] AmbariManagementControllerImpl:1147 - 
> Command is not created for servicecomponenthost , clusterName=c1, 
> clusterId=2, serviceName=HBASE, componentName=HBASE_MASTER, 
> hostname=c6401.ambari.apache.org, hostState=HEARTBEAT_LOST, 
> targetNewState=INSTALLED
> {noformat}
> However some tasks get created for the other agent and those succeed. At this 
> point, FE assumes that install succeeded and then issues a START all. That 
> results in state change errors we see in the log.
> _FE assumption is based on the fact that all tasks created succeeded._
> {noformat}
> 06:40:04,488 ERROR [qtp593591875-19] AbstractResourceProvider:302 - Caught 
> AmbariException when modifying a resource
> org.apache.ambari.server.AmbariException: Invalid transition for 
> servicecomponenthost, clusterName=c1, clusterId=2, serviceName=ZOOKEEPER, 
> componentName=ZOOKEEPER_SERVER, hostname=c6401.ambari.apache.org, 
> currentState=INSTALL_FAILED, newDesiredState=STARTED
> {noformat}
> We should discuss possible solutions. One solution could be to have FE not 
> issue a START if there are master components that are in INSTALL_FAILED 
> state. In addition, if we can show that some hosts are in HEARTBEAT_LOST 
> state then it can help user debug the situation. Other option is to have BE 
> somehow indicate that tasks did not get created for host(s). In any case, 
> when a host is down, we need a way to get out of the install wizard.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to