[ 
https://issues.apache.org/jira/browse/AMBARI-4323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Lysnichenko updated AMBARI-4323:
---------------------------------------

    Description: 
Implementation proposal:
1. Add new command type CANCEL_COMMAND to agent-server protocol. CANCEL_COMMAND 
contains identifier (task_id + stage_id) of an exact command for cancellation.
2. At the server side, commands of this type are issued when tasks are 
considered timed out. I'm going to do that here: 
org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage.
3. At the agent side, CANCEL_COMMANDs are executed inside Controller.py right 
after arrival (they are not put into ActionQueue). If command mentioned by the 
CANCEL_COMMAND is not present in ActionQueue (it is already in progress or 
completed), CANCEL_COMMAND is silently ignored.
4. Also, agent clears entire action queue when it can not continue exchanging 
heartbeats with the server (disconnect or registration requested). I'm going to 
add an appropriate logic to 
src.main.python.ambari_agent.Controller.Controller#registerAndHeartbeat. The 
motivation is to make recovery from network/server fail more reliable and fast 
(agent will have an empty ActionQueue and can start executing new 
EXECUTION_COMMANDS and STATUS_COMMANDS right after registration).
5. In both cases described above (executing a single CANCEL_COMMAND or clearing 
entire ActionQueue) EXECUTION_COMMANDS are considered transactional-like. I 
mean that EXECUTION_COMMANDs that are already IN_PROGRESS are never 
interrupted. Thus we decrease chanses of leaving system in 
misconfigured/unpredictable state.
Also, I'm going to fix a bug at 
org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage . 
Here, we pass stage timeout instead of task timeout as a parameter of 
org.apache.ambari.server.actionmanager.ActionScheduler#timeOutActionNeeded . 
After bugfix, task timeout + some small time will be passed as a parameter 
value. Additional smal time (10-30 seconds) is needed to avoid sending 
CANCEL_COMMAND without absolute necessary (task will timeout at agent 
automatically in most cases).

  was:
- add new command type CLEAR_QUEUE both on agent and server 
- server sends this command:
1)  when some command times out
2)  after agent registration (to cancel any commands from the previous session)
- add new possible command report state ABORTED to the server
- when agent receives CLEAR_QUEUE command, it immediately removes all scheduled 
commands from the queue and for every command in the queue send the command 
report with ABORTED state.

Some notes on implementation:
Cancelling commands that are not executed yet requires relatively small effort.

Cancelling command in progress is possible (we may invoke kill-on-timeout 
callback method manually or via event, effectively killing all subprocesses), 
but may occasionally leave system configuration in a broken state.  

Or,  instead of clear Queue, we should probably send KILL TASK or something to 
kill specific tasks or remove them in the queue. If we choose to do so, we may 
also manage task timeouts only at the server.


> Add ability to an agent to clear the ActionQueue
> ------------------------------------------------
>
>                 Key: AMBARI-4323
>                 URL: https://issues.apache.org/jira/browse/AMBARI-4323
>             Project: Ambari
>          Issue Type: Improvement
>          Components: agent, controller
>    Affects Versions: 1.5.0
>            Reporter: Dmitry Lysnichenko
>            Assignee: Dmitry Lysnichenko
>             Fix For: 1.5.0
>
>
> Implementation proposal:
> 1. Add new command type CANCEL_COMMAND to agent-server protocol. 
> CANCEL_COMMAND contains identifier (task_id + stage_id) of an exact command 
> for cancellation.
> 2. At the server side, commands of this type are issued when tasks are 
> considered timed out. I'm going to do that here: 
> org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage.
> 3. At the agent side, CANCEL_COMMANDs are executed inside Controller.py right 
> after arrival (they are not put into ActionQueue). If command mentioned by 
> the CANCEL_COMMAND is not present in ActionQueue (it is already in progress 
> or completed), CANCEL_COMMAND is silently ignored.
> 4. Also, agent clears entire action queue when it can not continue exchanging 
> heartbeats with the server (disconnect or registration requested). I'm going 
> to add an appropriate logic to 
> src.main.python.ambari_agent.Controller.Controller#registerAndHeartbeat. The 
> motivation is to make recovery from network/server fail more reliable and 
> fast (agent will have an empty ActionQueue and can start executing new 
> EXECUTION_COMMANDS and STATUS_COMMANDS right after registration).
> 5. In both cases described above (executing a single CANCEL_COMMAND or 
> clearing entire ActionQueue) EXECUTION_COMMANDS are considered 
> transactional-like. I mean that EXECUTION_COMMANDs that are already 
> IN_PROGRESS are never interrupted. Thus we decrease chanses of leaving system 
> in misconfigured/unpredictable state.
> Also, I'm going to fix a bug at 
> org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage 
> . Here, we pass stage timeout instead of task timeout as a parameter of 
> org.apache.ambari.server.actionmanager.ActionScheduler#timeOutActionNeeded . 
> After bugfix, task timeout + some small time will be passed as a parameter 
> value. Additional smal time (10-30 seconds) is needed to avoid sending 
> CANCEL_COMMAND without absolute necessary (task will timeout at agent 
> automatically in most cases).



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to