-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/44926/
-----------------------------------------------------------

(Updated March 21, 2016, 7:03 p.m.)


Review request for Ambari, Jonathan Hurley and Nate Cole.


Bugs: AMBARI-15446
    https://issues.apache.org/jira/browse/AMBARI-15446


Repository: ambari


Description
-------

When a failure occurs during RU/EU and the task transitions to HOLDING_FAILED 
or HOLDING_TIMEDOUT, want Ambari to automatically retry up to up to x mins. 
This is useful when a host goes down as Ambari is running a task on it.
ambari.properties will have 1 new parameter. E.g,. 
stack-upgrade.max_retry_timeout_mins=15 (by default, will not be present)
If Ambari Server is restarted, it should be able to recover.
Today, Action Scheduler increases the attempt_count whenever a task is retried, 
but it requires resetting the start_time to -1. Because of this, we cannot rely 
on the start_time property to know when to timeout after several retries.

For the implementation, will add another thread to Ambari that will monitor 
failed tasks only during active RU/EU and change the status back to PENDING so 
that Action Scheduler can reschedule it.
Luckily, any tasks in HOLDING_TIMEDOUT and HOLDING_FAILED states are blocking, 
so no other stages are allowed to proceed.
In order to know when a task was first started, will add a new property to 
host_role_command table called original_start_time.

For the agents, we need to ensure that they always write out a response. On the 
first heartbeat, it should send the status of its last command so we know it 
failed and Ambari can retry.


Diffs (updated)
-----

  
ambari-server/src/main/java/org/apache/ambari/server/actionmanager/ActionDBAccessorImpl.java
 429f573 
  
ambari-server/src/main/java/org/apache/ambari/server/actionmanager/HostRoleCommand.java
 2764b3f 
  
ambari-server/src/main/java/org/apache/ambari/server/agent/HeartbeatProcessor.java
 a1a686a 
  
ambari-server/src/main/java/org/apache/ambari/server/configuration/Configuration.java
 9404506 
  
ambari-server/src/main/java/org/apache/ambari/server/orm/dao/HostRoleCommandDAO.java
 f5b1cb4 
  
ambari-server/src/main/java/org/apache/ambari/server/orm/entities/HostRoleCommandEntity.java
 19f0602 
  
ambari-server/src/main/java/org/apache/ambari/server/state/services/RetryUpgradeActionService.java
 PRE-CREATION 
  
ambari-server/src/main/java/org/apache/ambari/server/topology/HostRequest.java 
9eb514a 
  
ambari-server/src/main/java/org/apache/ambari/server/topology/LogicalRequest.java
 82edbcf 
  
ambari-server/src/main/java/org/apache/ambari/server/upgrade/UpgradeCatalog240.java
 7b83710 
  ambari-server/src/main/resources/Ambari-DDL-MySQL-CREATE.sql a07c6fc 
  ambari-server/src/main/resources/Ambari-DDL-Oracle-CREATE.sql b2b450a 
  ambari-server/src/main/resources/Ambari-DDL-Postgres-CREATE.sql cec122e 
  ambari-server/src/main/resources/Ambari-DDL-Postgres-EMBEDDED-CREATE.sql 
96fc720 
  ambari-server/src/main/resources/Ambari-DDL-SQLAnywhere-CREATE.sql c425d6f 
  ambari-server/src/main/resources/Ambari-DDL-SQLServer-CREATE.sql 2a89e26 
  
ambari-server/src/test/java/org/apache/ambari/server/state/services/RetryUpgradeActionServiceTest.java
 PRE-CREATION 

Diff: https://reviews.apache.org/r/44926/diff/


Testing
-------

Verified on a live cluster.
New unit test passed, waiting for full set of unit test results.


Thanks,

Alejandro Fernandez

Reply via email to