[ 
https://issues.apache.org/jira/browse/AMBARI-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13909181#comment-13909181
 ] 

Yusaku Sako commented on AMBARI-4530:
-------------------------------------

+1 for the patch.  Nice work.

> Cluster install errors out strangely without starting services
> --------------------------------------------------------------
>
>                 Key: AMBARI-4530
>                 URL: https://issues.apache.org/jira/browse/AMBARI-4530
>             Project: Ambari
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 1.4.4
>            Reporter: Jaimin D Jetly
>            Assignee: Jaimin D Jetly
>             Fix For: 1.5.0
>
>         Attachments: AMBARI-4530.patch, AMBARI-4530_2.patch, Solution-1.png, 
> Solution-2.png, Solution-3.png
>
>
> On a two host cluster and one of the agents was down.
> First INSTALL attempt fails as tasks for the down agent time out and get 
> aborted.
> When INSTALL is retried, there are no tasks created for one host (as agent is 
> down and thus host is in HEARTBEAT_LOST state).
> {noformat}
> 06:38:55,649  INFO [qtp593591875-22] AmbariManagementControllerImpl:1147 - 
> Command is not created for servicecomponenthost , clusterName=c1, 
> clusterId=2, serviceName=HBASE, componentName=HBASE_MASTER, 
> hostname=c6401.ambari.apache.org, hostState=HEARTBEAT_LOST, 
> targetNewState=INSTALLED
> {noformat}
> However some tasks get created for the other agent and those succeed. At this 
> point, FE assumes that install succeeded and then issues a START all. That 
> results in state change errors we see in the log.
> _FE assumption is based on the fact that all tasks created succeeded._
> {noformat}
> 06:40:04,488 ERROR [qtp593591875-19] AbstractResourceProvider:302 - Caught 
> AmbariException when modifying a resource
> org.apache.ambari.server.AmbariException: Invalid transition for 
> servicecomponenthost, clusterName=c1, clusterId=2, serviceName=ZOOKEEPER, 
> componentName=ZOOKEEPER_SERVER, hostname=c6401.ambari.apache.org, 
> currentState=INSTALL_FAILED, newDesiredState=STARTED
> {noformat}
> We should discuss possible solutions. One solution could be to have FE not 
> issue a START if there are master components that are in INSTALL_FAILED 
> state. In addition, if we can show that some hosts are in HEARTBEAT_LOST 
> state then it can help user debug the situation. Other option is to have BE 
> somehow indicate that tasks did not get created for host(s). In any case, 
> when a host is down, we need a way to get out of the install wizard.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to