-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/44926/
-----------------------------------------------------------
(Updated March 17, 2016, 11:07 p.m.)
Review request for Ambari, Jonathan Hurley and Nate Cole.
Changes
-------
Added more fixes, still need to write unit tests.
Bugs: AMBARI-15446
https://issues.apache.org/jira/browse/AMBARI-15446
Repository: ambari
Description
-------
When a failure occurs during RU/EU and the task transitions to HOLDING_FAILED
or HOLDING_TIMEDOUT, want Ambari to automatically retry up to up to x mins.
This is useful when a host goes down as Ambari is running a task on it.
ambari.properties will have 1 new parameter. E.g,.
stack-upgrade.max_retry_timeout_mins=15 (by default, will not be present)
If Ambari Server is restarted, it should be able to recover.
Today, Action Scheduler increases the attempt_count whenever a task is retried,
but it requires resetting the start_time to -1. Because of this, we cannot rely
on the start_time property to know when to timeout after several retries.
For the implementation, will add another thread to Ambari that will monitor
failed tasks only during active RU/EU and change the status back to PENDING so
that Action Scheduler can reschedule it.
Luckily, any tasks in HOLDING_TIMEDOUT and HOLDING_FAILED states are blocking,
so no other stages are allowed to proceed.
In order to know when a task was first started, will add a new property to
host_role_command table called original_start_time.
For the agents, we need to ensure that they always write out a response. On the
first heartbeat, it should send the status of its last command so we know it
failed and Ambari can retry.
Diffs (updated)
-----
ambari-server/src/main/java/org/apache/ambari/server/actionmanager/ActionDBAccessorImpl.java
429f573
ambari-server/src/main/java/org/apache/ambari/server/actionmanager/HostRoleCommand.java
2764b3f
ambari-server/src/main/java/org/apache/ambari/server/agent/HeartBeatHandler.java
3a80803
ambari-server/src/main/java/org/apache/ambari/server/agent/HeartbeatProcessor.java
a1a686a
ambari-server/src/main/java/org/apache/ambari/server/agent/RetryActionMonitor.java
PRE-CREATION
ambari-server/src/main/java/org/apache/ambari/server/configuration/Configuration.java
9404506
ambari-server/src/main/java/org/apache/ambari/server/orm/dao/HostRoleCommandDAO.java
f5b1cb4
ambari-server/src/main/java/org/apache/ambari/server/orm/entities/HostRoleCommandEntity.java
19f0602
ambari-server/src/main/java/org/apache/ambari/server/topology/HostRequest.java
9eb514a
ambari-server/src/main/java/org/apache/ambari/server/topology/LogicalRequest.java
82edbcf
ambari-server/src/main/java/org/apache/ambari/server/upgrade/UpgradeCatalog240.java
a803f73
ambari-server/src/main/resources/Ambari-DDL-MySQL-CREATE.sql 9b4810c
ambari-server/src/main/resources/Ambari-DDL-Oracle-CREATE.sql cc3d197
ambari-server/src/main/resources/Ambari-DDL-Postgres-CREATE.sql 07c786d
ambari-server/src/main/resources/Ambari-DDL-Postgres-EMBEDDED-CREATE.sql
ab6dc93
ambari-server/src/main/resources/Ambari-DDL-SQLAnywhere-CREATE.sql 8e91fde
ambari-server/src/main/resources/Ambari-DDL-SQLServer-CREATE.sql 440ca44
Diff: https://reviews.apache.org/r/44926/diff/
Testing
-------
Verified on a live cluster.
TODO: Still need to make more changes to the implementation, add the config,
switch to gauva service, add a column, and add unit tests.
Thanks,
Alejandro Fernandez