[ 
https://issues.apache.org/jira/browse/AMBARI-4323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Lysnichenko updated AMBARI-4323:
---------------------------------------

    Description: 
h2. Implementation proposal:
1. Add a new command type CANCEL_COMMAND to agent-server protocol. 
CANCEL_COMMAND contains identifier (task_id + stage_id) of an exact command for 
cancellation and an arbitrary text string (reasoning for command cancelation).  
So CANCEL_COMMAND looks like 

{code}
{
  target_task_id: "4-3"
  reason: "Aborted by user via API"
}
{code}

2. At the server side, commands of this type are issued automagically when 
tasks are considered timed out. I'm going to do that here: 
org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage. 
Also we will implement (in a separate jira) an ability to cancel arbitrary 
order via server API. A new method addCancelCommandAction() at 
org/apache/ambari/server/controller/AmbariCustomCommandExecutionHelper.java 
will become the endpoint that forms up a new CANCEL_COMMAND.

3. At the agent side, CANCEL_COMMANDs are executed inside Controller.py right 
after arrival (they are not put into ActionQueue). If command mentioned by the 
CANCEL_COMMAND is not present in the ActionQueue (it is already in progress or 
completed) and command is not IN_PROGRESS, CANCEL_COMMAND is silently ignored. 
After executing  CANCEL_COMMAND, agent starts executing next EXECUTION_COMMAND 
from the ActionQueue.

4. Also, agent clears entire action queue on every registration (disconnected 
from the server or the re-registration is requested). I'm going to add an 
appropriate logic to 
src.main.python.ambari_agent.Controller.Controller#registerAndHeartbeat. The 
motivation for doing that is to make a recovery from the network/server fail 
more reliable and fast (agent will have an empty ActionQueue and will be able 
to execute new EXECUTION_COMMANDS and STATUS_COMMANDS immediately after 
registration). Currently, after re-registration agent is locked up and 
continues to execute stale EXECUTION_COMMANDS.
Also I'll recheck that server discards tasks for host when heartbeat is lost

5.When appropriate CANCEL_COMMAND is received, EXECUTION_COMMAND is cancelled 
even if it is IN_PROGRESS (the process is killed). 

6. Agent forms up command reports for cancelled commands just like it is done 
for COMPLETE and FAILED commands. Command statuses for cancelled commands are 
set to FAILED. I did not find enough reasoning for adding a new command report 
state CANCELED, feedback is welcome. Reasoning text (why command has been 
cancelled) is appended to command stderr and to command stdout.

So, cancelled command report looks like:

{code}
{
  taskId: "4-3"
  status : FAILED
  stderr : ".... some text ... \n Command was aborted because of: Aborted by 
user via API "
  stdout : ".... some text ... \n Command was aborted because of: Aborted by 
user via API "
  exitcode: 999
}
{code}

Also, I'm going to fix a naming issue at 
org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage . 
Here, we pass a stage timeout as a parameter for 
org.apache.ambari.server.actionmanager.ActionScheduler#timeOutActionNeeded, but 
the variable name is misleading (taskTimeout)

This implementation should also solve another related jira AMBARI-4324

  was:
h2. Implementation proposal:
1. Add a new command type CANCEL_COMMAND to agent-server protocol. 
CANCEL_COMMAND contains identifier (task_id + stage_id) of an exact command for 
cancellation and an arbitrary text string (reasoning for command cancelation).  
So CANCEL_COMMAND looks like 

{code}
{
  target_task_id: "4-3"
  reason: "Aborted by user via API"
}
{code}

2. At the server side, commands of this type are issued automagically when 
tasks are considered timed out. I'm going to do that here: 
org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage. 
Also we will implement (in a separate jira) an ability to cancel arbitrary 
order via server API. A new method addCancelCommandAction() at 
org/apache/ambari/server/controller/AmbariCustomCommandExecutionHelper.java 
will become the endpoint that forms up a new CANCEL_COMMAND.

3. At the agent side, CANCEL_COMMANDs are executed inside Controller.py right 
after arrival (they are not put into ActionQueue). If command mentioned by the 
CANCEL_COMMAND is not present in the ActionQueue (it is already in progress or 
completed) and command is not IN_PROGRESS, CANCEL_COMMAND is silently ignored. 
After executing  CANCEL_COMMAND, agent starts executing next EXECUTION_COMMAND 
from the ActionQueue.

4. Also, agent clears entire action queue on every registration (disconnected 
from the server or the re-registration is requested). I'm going to add an 
appropriate logic to 
src.main.python.ambari_agent.Controller.Controller#registerAndHeartbeat. The 
motivation for doing that is to make a recovery from the network/server fail 
more reliable and fast (agent will have an empty ActionQueue and will be able 
to execute new EXECUTION_COMMANDS and STATUS_COMMANDS immediately after 
registration). Currently, after re-registration agent is locked up and 
continues to execute stale EXECUTION_COMMANDS.

5. -In both cases described above (executing a single CANCEL_COMMAND or 
clearing entire ActionQueue) EXECUTION_COMMANDS are considered 
transactional-like. I mean that EXECUTION_COMMANDs that are already IN_PROGRESS 
are never interrupted. Thus we decrease chanses of leaving system in 
misconfigured/unpredictable state.- When appropriate CANCEL_COMMAND is 
received, EXECUTION_COMMAND is cancelled even if it is IN_PROGRESS. 

6. Agent forms up command reports for cancelled commands just like it is done 
for COMPLETE and FAILED commands. Command statuses for cancelled commands are 
set to FAILED. I did not find enough reasoning for adding a new command report 
state CANCELED, feedback is welcome. Reasoning text (why command has been 
cancelled) is appended to command stderr and to command stdout.

So, cancelled command report looks like:

{code}
{
  taskId: "4-3"
  status : FAILED
  stderr : ".... some text ... \n Command was aborted because of: Aborted by 
user via API "
  stdout : ".... some text ... \n Command was aborted because of: Aborted by 
user via API "
  exitcode: 999
}
{code}

Also, I'm going to fix a naming issue at 
org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage . 
Here, we pass a stage timeout as a parameter for 
org.apache.ambari.server.actionmanager.ActionScheduler#timeOutActionNeeded, but 
the variable name is misleading (taskTimeout)

This implementation should also solve another related jira AMBARI-4324


> Add ability to an agent to clear the ActionQueue
> ------------------------------------------------
>
>                 Key: AMBARI-4323
>                 URL: https://issues.apache.org/jira/browse/AMBARI-4323
>             Project: Ambari
>          Issue Type: Improvement
>          Components: agent, controller
>    Affects Versions: 1.5.0
>            Reporter: Dmitry Lysnichenko
>            Assignee: Dmitry Lysnichenko
>             Fix For: 1.5.0
>
>
> h2. Implementation proposal:
> 1. Add a new command type CANCEL_COMMAND to agent-server protocol. 
> CANCEL_COMMAND contains identifier (task_id + stage_id) of an exact command 
> for cancellation and an arbitrary text string (reasoning for command 
> cancelation).  So CANCEL_COMMAND looks like 
> {code}
> {
>   target_task_id: "4-3"
>   reason: "Aborted by user via API"
> }
> {code}
> 2. At the server side, commands of this type are issued automagically when 
> tasks are considered timed out. I'm going to do that here: 
> org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage.
>  Also we will implement (in a separate jira) an ability to cancel arbitrary 
> order via server API. A new method addCancelCommandAction() at 
> org/apache/ambari/server/controller/AmbariCustomCommandExecutionHelper.java 
> will become the endpoint that forms up a new CANCEL_COMMAND.
> 3. At the agent side, CANCEL_COMMANDs are executed inside Controller.py right 
> after arrival (they are not put into ActionQueue). If command mentioned by 
> the CANCEL_COMMAND is not present in the ActionQueue (it is already in 
> progress or completed) and command is not IN_PROGRESS, CANCEL_COMMAND is 
> silently ignored. After executing  CANCEL_COMMAND, agent starts executing 
> next EXECUTION_COMMAND from the ActionQueue.
> 4. Also, agent clears entire action queue on every registration (disconnected 
> from the server or the re-registration is requested). I'm going to add an 
> appropriate logic to 
> src.main.python.ambari_agent.Controller.Controller#registerAndHeartbeat. The 
> motivation for doing that is to make a recovery from the network/server fail 
> more reliable and fast (agent will have an empty ActionQueue and will be able 
> to execute new EXECUTION_COMMANDS and STATUS_COMMANDS immediately after 
> registration). Currently, after re-registration agent is locked up and 
> continues to execute stale EXECUTION_COMMANDS.
> Also I'll recheck that server discards tasks for host when heartbeat is lost
> 5.When appropriate CANCEL_COMMAND is received, EXECUTION_COMMAND is cancelled 
> even if it is IN_PROGRESS (the process is killed). 
> 6. Agent forms up command reports for cancelled commands just like it is done 
> for COMPLETE and FAILED commands. Command statuses for cancelled commands are 
> set to FAILED. I did not find enough reasoning for adding a new command 
> report state CANCELED, feedback is welcome. Reasoning text (why command has 
> been cancelled) is appended to command stderr and to command stdout.
> So, cancelled command report looks like:
> {code}
> {
>   taskId: "4-3"
>   status : FAILED
>   stderr : ".... some text ... \n Command was aborted because of: Aborted by 
> user via API "
>   stdout : ".... some text ... \n Command was aborted because of: Aborted by 
> user via API "
>   exitcode: 999
> }
> {code}
> Also, I'm going to fix a naming issue at 
> org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage 
> . Here, we pass a stage timeout as a parameter for 
> org.apache.ambari.server.actionmanager.ActionScheduler#timeOutActionNeeded, 
> but the variable name is misleading (taskTimeout)
> This implementation should also solve another related jira AMBARI-4324



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to