[
https://issues.apache.org/jira/browse/AMBARI-4323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Lysnichenko updated AMBARI-4323:
---------------------------------------
Description:
Implementation proposal:
1. Add new command type CANCEL_COMMAND to agent-server protocol. CANCEL_COMMAND
contains identifier (task_id + stage_id) of an exact command for cancellation.
2. At the server side, commands of this type are issued when tasks are
considered timed out. I'm going to do that here:
org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage.
3. At the agent side, CANCEL_COMMANDs are executed inside Controller.py right
after arrival (they are not put into ActionQueue). If command mentioned by the
CANCEL_COMMAND is not present in ActionQueue (it is already in progress or
completed), CANCEL_COMMAND is silently ignored.
4. Also, agent clears entire action queue when it can not continue exchanging
heartbeats with the server (disconnect or registration requested). I'm going to
add an appropriate logic to
src.main.python.ambari_agent.Controller.Controller#registerAndHeartbeat. The
motivation is to make recovery from network/server fail more reliable and fast
(agent will have an empty ActionQueue and can start executing new
EXECUTION_COMMANDS and STATUS_COMMANDS right after registration).
5. In both cases described above (executing a single CANCEL_COMMAND or clearing
entire ActionQueue) EXECUTION_COMMANDS are considered transactional-like. I
mean that EXECUTION_COMMANDs that are already IN_PROGRESS are never
interrupted. Thus we decrease chanses of leaving system in
misconfigured/unpredictable state.
Also, I'm going to fix a bug at
org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage .
Here, we pass stage timeout instead of task timeout as a parameter of
org.apache.ambari.server.actionmanager.ActionScheduler#timeOutActionNeeded .
After bugfix, task timeout + some small time will be passed as a parameter
value. Additional smal time (10-30 seconds) is needed to avoid sending
CANCEL_COMMAND without absolute necessary (task will timeout at agent
automatically in most cases).
was:
- add new command type CLEAR_QUEUE both on agent and server
- server sends this command:
1) when some command times out
2) after agent registration (to cancel any commands from the previous session)
- add new possible command report state ABORTED to the server
- when agent receives CLEAR_QUEUE command, it immediately removes all scheduled
commands from the queue and for every command in the queue send the command
report with ABORTED state.
Some notes on implementation:
Cancelling commands that are not executed yet requires relatively small effort.
Cancelling command in progress is possible (we may invoke kill-on-timeout
callback method manually or via event, effectively killing all subprocesses),
but may occasionally leave system configuration in a broken state.
Or, instead of clear Queue, we should probably send KILL TASK or something to
kill specific tasks or remove them in the queue. If we choose to do so, we may
also manage task timeouts only at the server.
> Add ability to an agent to clear the ActionQueue
> ------------------------------------------------
>
> Key: AMBARI-4323
> URL: https://issues.apache.org/jira/browse/AMBARI-4323
> Project: Ambari
> Issue Type: Improvement
> Components: agent, controller
> Affects Versions: 1.5.0
> Reporter: Dmitry Lysnichenko
> Assignee: Dmitry Lysnichenko
> Fix For: 1.5.0
>
>
> Implementation proposal:
> 1. Add new command type CANCEL_COMMAND to agent-server protocol.
> CANCEL_COMMAND contains identifier (task_id + stage_id) of an exact command
> for cancellation.
> 2. At the server side, commands of this type are issued when tasks are
> considered timed out. I'm going to do that here:
> org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage.
> 3. At the agent side, CANCEL_COMMANDs are executed inside Controller.py right
> after arrival (they are not put into ActionQueue). If command mentioned by
> the CANCEL_COMMAND is not present in ActionQueue (it is already in progress
> or completed), CANCEL_COMMAND is silently ignored.
> 4. Also, agent clears entire action queue when it can not continue exchanging
> heartbeats with the server (disconnect or registration requested). I'm going
> to add an appropriate logic to
> src.main.python.ambari_agent.Controller.Controller#registerAndHeartbeat. The
> motivation is to make recovery from network/server fail more reliable and
> fast (agent will have an empty ActionQueue and can start executing new
> EXECUTION_COMMANDS and STATUS_COMMANDS right after registration).
> 5. In both cases described above (executing a single CANCEL_COMMAND or
> clearing entire ActionQueue) EXECUTION_COMMANDS are considered
> transactional-like. I mean that EXECUTION_COMMANDs that are already
> IN_PROGRESS are never interrupted. Thus we decrease chanses of leaving system
> in misconfigured/unpredictable state.
> Also, I'm going to fix a bug at
> org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage
> . Here, we pass stage timeout instead of task timeout as a parameter of
> org.apache.ambari.server.actionmanager.ActionScheduler#timeOutActionNeeded .
> After bugfix, task timeout + some small time will be passed as a parameter
> value. Additional smal time (10-30 seconds) is needed to avoid sending
> CANCEL_COMMAND without absolute necessary (task will timeout at agent
> automatically in most cases).
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)