[ 
https://issues.apache.org/jira/browse/AMBARI-4323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Lysnichenko updated AMBARI-4323:
---------------------------------------

    Description: 
h2. Implementation proposal:
1. Add a new command type CANCEL_COMMAND to agent-server protocol. 
CANCEL_COMMAND contains identifier (task_id + stage_id) of an exact command for 
cancellation and an arbitrary text string (reasoning for command cancelation).  
So CANCEL_COMMAND looks like 

{code}
{
  target_task_id: "4-3"
  reason: "Aborted by user via API"
}
{code}

2. At the server side, commands of this type are issued automagically when 
tasks are considered timed out. I'm going to do that here: 
org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage. 
Also we will implement (in a separate jira) an ability to cancel arbitrary 
order via server API. A new method addCancelCommandAction() at 
org/apache/ambari/server/controller/AmbariCustomCommandExecutionHelper.java 
will become the endpoint that forms up a new CANCEL_COMMAND.

3. At the agent side, CANCEL_COMMANDs are executed inside Controller.py right 
after arrival (they are not put into ActionQueue). If command mentioned by the 
CANCEL_COMMAND is not present in the ActionQueue (it is already in progress or 
completed) and command is not IN_PROGRESS, CANCEL_COMMAND is silently ignored. 
After executing  CANCEL_COMMAND, agent starts executing next EXECUTION_COMMAND 
from the ActionQueue.

4. Also, agent clears entire action queue on every registration (disconnected 
from the server or the re-registration is requested). I'm going to add an 
appropriate logic to 
src.main.python.ambari_agent.Controller.Controller#registerAndHeartbeat. The 
motivation for doing that is to make a recovery from the network/server fail 
more reliable and fast (agent will have an empty ActionQueue and will be able 
to execute new EXECUTION_COMMANDS and STATUS_COMMANDS immediately after 
registration). Currently, after re-registration agent is locked up and 
continues to execute stale EXECUTION_COMMANDS.

5. -In both cases described above (executing a single CANCEL_COMMAND or 
clearing entire ActionQueue) EXECUTION_COMMANDS are considered 
transactional-like. I mean that EXECUTION_COMMANDs that are already IN_PROGRESS 
are never interrupted. Thus we decrease chanses of leaving system in 
misconfigured/unpredictable state.- When appropriate CANCEL_COMMAND is 
received, EXECUTION_COMMAND is cancelled even if it is IN_PROGRESS. 

6. Agent forms up command reports for cancelled commands just like it is done 
for COMPLETE and FAILED commands. Command statuses for cancelled commands are 
set to FAILED. I did not find enough reasoning for adding a new command report 
state CANCELED, feedback is welcome. Reasoning text (why command has been 
cancelled) is appended to command stderr and to command stdout.

So, cancelled command report looks like:

{code}
{
  taskId: "4-3"
  status : FAILED
  stderr : ".... some text ... \n Command was aborted because of: Aborted by 
user via API "
  stdout : ".... some text ... \n Command was aborted because of: Aborted by 
user via API "
  exitcode: 999
}
{code}

Also, I'm going to fix a naming issue at 
org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage . 
Here, we pass a stage timeout as a parameter for 
org.apache.ambari.server.actionmanager.ActionScheduler#timeOutActionNeeded, but 
the variable name is misleading (taskTimeout)

This implementation should also solve another related jira AMBARI-4324

  was:
h2. Implementation proposal:
1. Add a new command type CANCEL_COMMAND to agent-server protocol. 
CANCEL_COMMAND contains identifier (task_id + stage_id) of an exact command for 
cancellation and an arbitrary text string (reasoning for command cancelation).  
So CANCEL_COMMAND looks like 

{code}
{
  target_task_id: "4-3"
  reason: "Aborted by user via API"
}
{code}

2. At the server side, commands of this type are issued automagically when 
tasks are considered timed out. I'm going to do that here: 
org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage. 
Also we will implement (in a separate jira) an ability to cancel arbitrary 
order via server API. A new method addCancelCommandAction() at 
org/apache/ambari/server/controller/AmbariCustomCommandExecutionHelper.java 
will become the endpoint that forms up a new CANCEL_COMMAND.

3. At the agent side, CANCEL_COMMANDs are executed inside Controller.py right 
after arrival (they are not put into ActionQueue). If command mentioned by the 
CANCEL_COMMAND is not present in the ActionQueue (it is already in progress or 
completed) and command is not IN_PROGRESS, CANCEL_COMMAND is silently ignored. 
After executing  CANCEL_COMMAND, agent starts executing next EXECUTION_COMMAND 
from the ActionQueue.

4. Also, agent clears entire action queue on every registration (disconnected 
from the server or the re-registration is requested). I'm going to add an 
appropriate logic to 
src.main.python.ambari_agent.Controller.Controller#registerAndHeartbeat. The 
motivation for doing that is to make a recovery from the network/server fail 
more reliable and fast (agent will have an empty ActionQueue and will be able 
to execute new EXECUTION_COMMANDS and STATUS_COMMANDS immediately after 
registration). Currently, after re-registration agent is locked up and 
continues to execute stale EXECUTION_COMMANDS.

5. -In both cases described above (executing a single CANCEL_COMMAND or 
clearing entire ActionQueue) EXECUTION_COMMANDS are considered 
transactional-like. I mean that EXECUTION_COMMANDs that are already IN_PROGRESS 
are never interrupted. Thus we decrease chanses of leaving system in 
misconfigured/unpredictable state.- When appropriate CANCEL_COMMAND is 
received, EXECUTION_COMMAND is cancelled even if it is IN_PROGRESS. 

6. Agent forms up command reports for cancelled commands just like it is done 
for COMPLETE and FAILED commands. Command statuses for cancelled commands are 
set to FAILED. I did not find enough reasoning for adding a new command report 
state CANCELED, feedback is welcome. Reasoning text (why command has been 
cancelled) is appended to command stderr and to command stdout.

So, cancelled command report looks like:

{code}
{
  taskId: "4-3"
  status : FAILED
  stderr : ".... some text ... \n Command was aborted because of: Aborted by 
user via API "
  stdout : ".... some text ... \n Command was aborted because of: Aborted by 
user via API "
  exitcode: 999
}
{code}

Also, I'm going to fix a bug at 
org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage . 
Here, we pass a stage timeout instead of task timeout as a parameter for 
org.apache.ambari.server.actionmanager.ActionScheduler#timeOutActionNeeded . 
After bugfix, task timeout + some small time will be passed as a parameter 
value. Additional smal time (10-30 seconds) is needed to avoid sending 
CANCEL_COMMAND when it is not absolutely necessary (task timeouts at agent 
automatically without server actions in most cases).

This implementation should also solve another related jira AMBARI-4324


> Add ability to an agent to clear the ActionQueue
> ------------------------------------------------
>
>                 Key: AMBARI-4323
>                 URL: https://issues.apache.org/jira/browse/AMBARI-4323
>             Project: Ambari
>          Issue Type: Improvement
>          Components: agent, controller
>    Affects Versions: 1.5.0
>            Reporter: Dmitry Lysnichenko
>            Assignee: Dmitry Lysnichenko
>             Fix For: 1.5.0
>
>
> h2. Implementation proposal:
> 1. Add a new command type CANCEL_COMMAND to agent-server protocol. 
> CANCEL_COMMAND contains identifier (task_id + stage_id) of an exact command 
> for cancellation and an arbitrary text string (reasoning for command 
> cancelation).  So CANCEL_COMMAND looks like 
> {code}
> {
>   target_task_id: "4-3"
>   reason: "Aborted by user via API"
> }
> {code}
> 2. At the server side, commands of this type are issued automagically when 
> tasks are considered timed out. I'm going to do that here: 
> org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage.
>  Also we will implement (in a separate jira) an ability to cancel arbitrary 
> order via server API. A new method addCancelCommandAction() at 
> org/apache/ambari/server/controller/AmbariCustomCommandExecutionHelper.java 
> will become the endpoint that forms up a new CANCEL_COMMAND.
> 3. At the agent side, CANCEL_COMMANDs are executed inside Controller.py right 
> after arrival (they are not put into ActionQueue). If command mentioned by 
> the CANCEL_COMMAND is not present in the ActionQueue (it is already in 
> progress or completed) and command is not IN_PROGRESS, CANCEL_COMMAND is 
> silently ignored. After executing  CANCEL_COMMAND, agent starts executing 
> next EXECUTION_COMMAND from the ActionQueue.
> 4. Also, agent clears entire action queue on every registration (disconnected 
> from the server or the re-registration is requested). I'm going to add an 
> appropriate logic to 
> src.main.python.ambari_agent.Controller.Controller#registerAndHeartbeat. The 
> motivation for doing that is to make a recovery from the network/server fail 
> more reliable and fast (agent will have an empty ActionQueue and will be able 
> to execute new EXECUTION_COMMANDS and STATUS_COMMANDS immediately after 
> registration). Currently, after re-registration agent is locked up and 
> continues to execute stale EXECUTION_COMMANDS.
> 5. -In both cases described above (executing a single CANCEL_COMMAND or 
> clearing entire ActionQueue) EXECUTION_COMMANDS are considered 
> transactional-like. I mean that EXECUTION_COMMANDs that are already 
> IN_PROGRESS are never interrupted. Thus we decrease chanses of leaving system 
> in misconfigured/unpredictable state.- When appropriate CANCEL_COMMAND is 
> received, EXECUTION_COMMAND is cancelled even if it is IN_PROGRESS. 
> 6. Agent forms up command reports for cancelled commands just like it is done 
> for COMPLETE and FAILED commands. Command statuses for cancelled commands are 
> set to FAILED. I did not find enough reasoning for adding a new command 
> report state CANCELED, feedback is welcome. Reasoning text (why command has 
> been cancelled) is appended to command stderr and to command stdout.
> So, cancelled command report looks like:
> {code}
> {
>   taskId: "4-3"
>   status : FAILED
>   stderr : ".... some text ... \n Command was aborted because of: Aborted by 
> user via API "
>   stdout : ".... some text ... \n Command was aborted because of: Aborted by 
> user via API "
>   exitcode: 999
> }
> {code}
> Also, I'm going to fix a naming issue at 
> org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage 
> . Here, we pass a stage timeout as a parameter for 
> org.apache.ambari.server.actionmanager.ActionScheduler#timeOutActionNeeded, 
> but the variable name is misleading (taskTimeout)
> This implementation should also solve another related jira AMBARI-4324



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to