[
https://issues.apache.org/jira/browse/AMBARI-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908940#comment-13908940
]
Jaimin D Jetly commented on AMBARI-4530:
----------------------------------------
*UI Design:*
As a solution, after install task is completed successfully, FE queries the
host state of all hosts successfully registered in the cluster. If any host is
in the "HEARTBEAT_LOST" state then FE declares the cluster to be in "INSTALL
FAILED" (Note: This cluster object is stored by FE in browser localStorage).
This will make UI behavior similar to what happens when any install request
task is reported to have failed. Importantly all the links to the previous
steps will be enabled and next button on the page will be disabled. So user
cannot complete the installer wizard but if desired the user can go back and
remove a host from the cluster.
In addition to What we see when install request fails, In this case when a host
is detected to be in "HEARTBEAT_LOST" state, UI will display the message next
to the host as {color:red}Heartbeat lost for the host {color}. Clicking on the
message will open a host pop-up that will display the error message. Please see
attached snapshot: Solution-3.png
Also an error message will be shown on the bottom of the page as
{color:red}Ambari agent is not running on <detected number>
hosts.{color}{color:blue} Show Details {color} Please see attached snapshot:
Solution-1.png
Clicking on {color:blue}Show Details {color}, opens a pop-up showing a map of
host to all components on that host. Please see attached snapshot:
Solution-2.png
*Assumption:*
If no host is in "HEARTBEAT_LOST" state at the successful completion of install
services request, there will be no hostComponent in UNKNOWN or INSTALL_FAILED
state.
> Cluster install errors out strangely without starting services
> --------------------------------------------------------------
>
> Key: AMBARI-4530
> URL: https://issues.apache.org/jira/browse/AMBARI-4530
> Project: Ambari
> Issue Type: Bug
> Components: client
> Affects Versions: 1.4.4
> Reporter: Jaimin D Jetly
> Assignee: Jaimin D Jetly
> Fix For: 1.5.0
>
> Attachments: AMBARI-4530.patch, AMBARI-4530_2.patch, Solution-1.png,
> Solution-2.png, Solution-3.png
>
>
> On a two host cluster and one of the agents was down.
> First INSTALL attempt fails as tasks for the down agent time out and get
> aborted.
> When INSTALL is retried, there are no tasks created for one host (as agent is
> down and thus host is in HEARTBEAT_LOST state).
> {noformat}
> 06:38:55,649 INFO [qtp593591875-22] AmbariManagementControllerImpl:1147 -
> Command is not created for servicecomponenthost , clusterName=c1,
> clusterId=2, serviceName=HBASE, componentName=HBASE_MASTER,
> hostname=c6401.ambari.apache.org, hostState=HEARTBEAT_LOST,
> targetNewState=INSTALLED
> {noformat}
> However some tasks get created for the other agent and those succeed. At this
> point, FE assumes that install succeeded and then issues a START all. That
> results in state change errors we see in the log.
> _FE assumption is based on the fact that all tasks created succeeded._
> {noformat}
> 06:40:04,488 ERROR [qtp593591875-19] AbstractResourceProvider:302 - Caught
> AmbariException when modifying a resource
> org.apache.ambari.server.AmbariException: Invalid transition for
> servicecomponenthost, clusterName=c1, clusterId=2, serviceName=ZOOKEEPER,
> componentName=ZOOKEEPER_SERVER, hostname=c6401.ambari.apache.org,
> currentState=INSTALL_FAILED, newDesiredState=STARTED
> {noformat}
> We should discuss possible solutions. One solution could be to have FE not
> issue a START if there are master components that are in INSTALL_FAILED
> state. In addition, if we can show that some hosts are in HEARTBEAT_LOST
> state then it can help user debug the situation. Other option is to have BE
> somehow indicate that tasks did not get created for host(s). In any case,
> when a host is down, we need a way to get out of the install wizard.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)