-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/37739/
-----------------------------------------------------------

Review request for Ambari, Alejandro Fernandez and Nate Cole.


Bugs: AMBARI-12867
    https://issues.apache.org/jira/browse/AMBARI-12867


Repository: ambari


Description
-------

On 1000 node RU I had 2.3.0.0-2557 installed with some 20 hosts down with 
heartbeat lost. Then I registered 2.3.2.0-2664 and when I proceeded to install, 
it would always get aborted with no logs in server or agents. 

Turns out that whenever we install, we do so in stages containing 100 hosts 
each. If any of the host failed or timed out etc., the rest of the stages are 
aborted. So in this case the first stage had 1 host timeout, which resulted in 
that and other stages being aborted.

I cannot install a version without all hosts being alive. Workaround seems to 
be to delete lost hosts from Ambari.

The solution is to use the stage's success criteria to determine if the other 
stages in the request should be aborted.


Diffs
-----

  ambari-server/src/main/java/org/apache/ambari/server/Role.java 636df3f 
  
ambari-server/src/main/java/org/apache/ambari/server/controller/internal/ClusterStackVersionResourceProvider.java
 6133885 
  
ambari-server/src/test/java/org/apache/ambari/server/controller/internal/ClusterStackVersionResourceProviderTest.java
 a56823b 

Diff: https://reviews.apache.org/r/37739/diff/


Testing
-------

mvn clean test


Thanks,

Jonathan Hurley

Reply via email to