[jira] [Commented] (AMBARI-4323) Add ability to an agent to clear the ActionQueue

Dmitry Lysnichenko (JIRA) Mon, 24 Feb 2014 07:57:27 -0800

    [ 
https://issues.apache.org/jira/browse/AMBARI-4323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13910417#comment-13910417
 ]


Dmitry Lysnichenko commented on AMBARI-4323:
--------------------------------------------

[[email protected]], yes, we perform the same action as for commands that 
timed out: we kill the process and all child subprocesses.

[~sumitmohanty], 
{quote}
I assume when a CANCEL is issued for a task the state of the task in the server 
side is IN_PROGRESS. Only, when the agent returns the command status that the 
server will update the state to FAILED.
{quote}
CANCEL command may be issued for any task id (both IN_PROGRESS or just 
scheduled)

{quote}
Are we also implementing canceling of IN_PROGRESS commands? Just wanted to 
confirm.
{quote}
yes, we may cancel even in_progress commands. If command has been already 
successfully finished by the moment when agent receives relevant CANCEL 
command, we display command as COMPLETED (cancellation has not effect)

{quote}
clears entire action queue on every registration: When expected and received 
heartbeat ids are not the same the server asks agents to re-register. I do not 
know if in this case server assumes that in progress commands are aborted. If 
server does not assumes that commands are aborted then agent should not clear 
the action queue. Can you verify the behavior here?
{quote}
 According to code /org/apache/ambari/server/agent/HeartbeatMonitor.java:144 , 
we clear queue server-side queue on heartbeat lost
{code}
        //Purge action queue
        actionQueue.dequeueAll(host);
        //notify action manager
        actionManager.handleLostHost(host);
{code}
But I'll recheck that when implementing the jira

{quote}
we pass a stage timeout instead of task timeout as a parameter: That part of 
the ActionScheduler always seems more complex than needed. Anyway, I have a 
question. Today stage timeout is the largest "sum of task timeouts on a single 
host". One a one node, during install the last_attempt_time is basically the 
same for all install tasks. So if we pass "task timeout + some small time" 
while checking timeOutActionNeeded() for the later tasks - ones in the end of 
the queue in Agent - will we not have a problem?
{quote}
Good point. Actually, we currently are passing stage timeout value, but it is 
named taskTimeout. So the fix is just to rename a variable.

> Add ability to an agent to clear the ActionQueue
> ------------------------------------------------
>
>                 Key: AMBARI-4323
>                 URL: https://issues.apache.org/jira/browse/AMBARI-4323
>             Project: Ambari
>          Issue Type: Improvement
>          Components: agent, controller
>    Affects Versions: 1.5.0
>            Reporter: Dmitry Lysnichenko
>            Assignee: Dmitry Lysnichenko
>             Fix For: 1.5.0
>
>
> h2. Implementation proposal:
> 1. Add a new command type CANCEL_COMMAND to agent-server protocol. 
> CANCEL_COMMAND contains identifier (task_id + stage_id) of an exact command 
> for cancellation and an arbitrary text string (reasoning for command 
> cancelation).  So CANCEL_COMMAND looks like 
> {code}
> {
>   target_task_id: "4-3"
>   reason: "Aborted by user via API"
> }
> {code}
> 2. At the server side, commands of this type are issued automagically when 
> tasks are considered timed out. I'm going to do that here: 
> org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage.
>  Also we will implement (in a separate jira) an ability to cancel arbitrary 
> order via server API. A new method addCancelCommandAction() at 
> org/apache/ambari/server/controller/AmbariCustomCommandExecutionHelper.java 
> will become the endpoint that forms up a new CANCEL_COMMAND.
> 3. At the agent side, CANCEL_COMMANDs are executed inside Controller.py right 
> after arrival (they are not put into ActionQueue). If command mentioned by 
> the CANCEL_COMMAND is not present in the ActionQueue (it is already in 
> progress or completed) and command is not IN_PROGRESS, CANCEL_COMMAND is 
> silently ignored. After executing  CANCEL_COMMAND, agent starts executing 
> next EXECUTION_COMMAND from the ActionQueue.
> 4. Also, agent clears entire action queue on every registration (disconnected 
> from the server or the re-registration is requested). I'm going to add an 
> appropriate logic to 
> src.main.python.ambari_agent.Controller.Controller#registerAndHeartbeat. The 
> motivation for doing that is to make a recovery from the network/server fail 
> more reliable and fast (agent will have an empty ActionQueue and will be able 
> to execute new EXECUTION_COMMANDS and STATUS_COMMANDS immediately after 
> registration). Currently, after re-registration agent is locked up and 
> continues to execute stale EXECUTION_COMMANDS.
> 5. -In both cases described above (executing a single CANCEL_COMMAND or 
> clearing entire ActionQueue) EXECUTION_COMMANDS are considered 
> transactional-like. I mean that EXECUTION_COMMANDs that are already 
> IN_PROGRESS are never interrupted. Thus we decrease chanses of leaving system 
> in misconfigured/unpredictable state.- When appropriate CANCEL_COMMAND is 
> received, EXECUTION_COMMAND is cancelled even if it is IN_PROGRESS. 
> 6. Agent forms up command reports for cancelled commands just like it is done 
> for COMPLETE and FAILED commands. Command statuses for cancelled commands are 
> set to FAILED. I did not find enough reasoning for adding a new command 
> report state CANCELED, feedback is welcome. Reasoning text (why command has 
> been cancelled) is appended to command stderr and to command stdout.
> So, cancelled command report looks like:
> {code}
> {
>   taskId: "4-3"
>   status : FAILED
>   stderr : ".... some text ... \n Command was aborted because of: Aborted by 
> user via API "
>   stdout : ".... some text ... \n Command was aborted because of: Aborted by 
> user via API "
>   exitcode: 999
> }
> {code}
> Also, I'm going to fix a bug at 
> org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage 
> . Here, we pass a stage timeout instead of task timeout as a parameter for 
> org.apache.ambari.server.actionmanager.ActionScheduler#timeOutActionNeeded . 
> After bugfix, task timeout + some small time will be passed as a parameter 
> value. Additional smal time (10-30 seconds) is needed to avoid sending 
> CANCEL_COMMAND when it is not absolutely necessary (task timeouts at agent 
> automatically without server actions in most cases).
> This implementation should also solve another related jira AMBARI-4324



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (AMBARI-4323) Add ability to an agent to clear the ActionQueue

Reply via email to