Alejandro Fernandez created AMBARI-15446:
--------------------------------------------
Summary: Auto-retry on failure during RU/EU
Key: AMBARI-15446
URL: https://issues.apache.org/jira/browse/AMBARI-15446
Project: Ambari
Issue Type: Story
Components: ambari-server
Affects Versions: 2.4.0
Reporter: Alejandro Fernandez
Assignee: Alejandro Fernandez
Fix For: 2.4.0
When a failure occurs during RU/EU and the task transitions to HOLDING_FAILED
or HOLDING_TIMEDOUT, want Ambari to automatically retry up to up to x mins.
This is useful when a host goes down as Ambari is running a task on it.
ambari.properties will have 1 new parameter. E.g,.
stack-upgrade.max_retry_timeout_mins=15 (by default, will not be present)
If Ambari Server is restarted, it should be able to recover.
Today, Action Scheduler increases the attempt_count whenever a task is retried,
but it requires resetting the start_time to -1. Because of this, we cannot rely
on the start_time property to know when to timeout after several retries.
For the implementation, will add another thread to Ambari that will monitor
failed tasks only during active RU/EU and change the status back to PENDING so
that Action Scheduler can reschedule it.
Luckily, any tasks in HOLDING_TIMEDOUT and HOLDING_FAILED states are blocking,
so no other stages are allowed to proceed.
In order to know when a task was first started, will add a new property to
host_role_command table called original_start_time.
For the agents, we need to ensure that they always write out a response. On the
first heartbeat, it should send the status of its last command so we know it
failed and Ambari can retry.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)