-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/44926/#review123927
-----------------------------------------------------------




ambari-server/src/main/java/org/apache/ambari/server/agent/RetryActionMonitor.java
 (lines 255 - 262)
<https://reviews.apache.org/r/44926/#comment186257>

    Can be only one, even if it's a downgrade.



ambari-server/src/main/java/org/apache/ambari/server/agent/RetryActionMonitor.java
 (line 277)
<https://reviews.apache.org/r/44926/#comment186258>

    I think you know why this and others like it can't go into a commit :)


- Nate Cole


On March 16, 2016, 4:41 p.m., Alejandro Fernandez wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/44926/
> -----------------------------------------------------------
> 
> (Updated March 16, 2016, 4:41 p.m.)
> 
> 
> Review request for Ambari, Jonathan Hurley and Nate Cole.
> 
> 
> Bugs: AMBARI-15446
>     https://issues.apache.org/jira/browse/AMBARI-15446
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> When a failure occurs during RU/EU and the task transitions to HOLDING_FAILED 
> or HOLDING_TIMEDOUT, want Ambari to automatically retry up to up to x mins. 
> This is useful when a host goes down as Ambari is running a task on it.
> ambari.properties will have 1 new parameter. E.g,. 
> stack-upgrade.max_retry_timeout_mins=15 (by default, will not be present)
> If Ambari Server is restarted, it should be able to recover.
> Today, Action Scheduler increases the attempt_count whenever a task is 
> retried, but it requires resetting the start_time to -1. Because of this, we 
> cannot rely on the start_time property to know when to timeout after several 
> retries.
> 
> For the implementation, will add another thread to Ambari that will monitor 
> failed tasks only during active RU/EU and change the status back to PENDING 
> so that Action Scheduler can reschedule it.
> Luckily, any tasks in HOLDING_TIMEDOUT and HOLDING_FAILED states are 
> blocking, so no other stages are allowed to proceed.
> In order to know when a task was first started, will add a new property to 
> host_role_command table called original_start_time.
> 
> For the agents, we need to ensure that they always write out a response. On 
> the first heartbeat, it should send the status of its last command so we know 
> it failed and Ambari can retry.
> 
> 
> Diffs
> -----
> 
>   
> ambari-server/src/main/java/org/apache/ambari/server/agent/HeartBeatHandler.java
>  3a80803 
>   
> ambari-server/src/main/java/org/apache/ambari/server/agent/RetryActionMonitor.java
>  PRE-CREATION 
>   
> ambari-server/src/main/java/org/apache/ambari/server/checks/PreviousUpgradeCompleted.java
>  3a4467f 
>   
> ambari-server/src/main/java/org/apache/ambari/server/orm/dao/ClusterVersionDAO.java
>  1bcca60 
>   
> ambari-server/src/main/java/org/apache/ambari/server/orm/dao/HostRoleCommandDAO.java
>  f5b1cb4 
>   
> ambari-server/src/main/java/org/apache/ambari/server/orm/entities/ClusterVersionEntity.java
>  f1867b4 
>   
> ambari-server/src/main/java/org/apache/ambari/server/orm/entities/HostRoleCommandEntity.java
>  19f0602 
>   ambari-server/src/main/java/org/apache/ambari/server/state/Cluster.java 
> ed3c772 
>   
> ambari-server/src/main/java/org/apache/ambari/server/state/cluster/ClusterImpl.java
>  1c7ff61 
>   
> ambari-server/src/main/java/org/apache/ambari/server/topology/LogicalRequest.java
>  82edbcf 
> 
> Diff: https://reviews.apache.org/r/44926/diff/
> 
> 
> Testing
> -------
> 
> Verified on a live cluster.
> 
> TODO: Still need to make more changes to the implementation, add the config, 
> switch to gauva service, add a column, and add unit tests.
> 
> 
> Thanks,
> 
> Alejandro Fernandez
> 
>

Reply via email to