[
https://issues.apache.org/jira/browse/AMBARI-4323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13910365#comment-13910365
]
Nate Cole commented on AMBARI-4323:
-----------------------------------
Is the purpose of the cancel command only to remove from the Queue? What about
those items that are IN_PROGRESS - how do you actually stop them? For example,
pretend you issue a START command, and the backing OS command (could be, say,
some hdfs command) blocks. Will you issue a kill -9 on the command? Or are we
just removing the "monitoring" for that request/task combination?
> Add ability to an agent to clear the ActionQueue
> ------------------------------------------------
>
> Key: AMBARI-4323
> URL: https://issues.apache.org/jira/browse/AMBARI-4323
> Project: Ambari
> Issue Type: Improvement
> Components: agent, controller
> Affects Versions: 1.5.0
> Reporter: Dmitry Lysnichenko
> Assignee: Dmitry Lysnichenko
> Fix For: 1.5.0
>
>
> h2. Implementation proposal:
> 1. Add a new command type CANCEL_COMMAND to agent-server protocol.
> CANCEL_COMMAND contains identifier (task_id + stage_id) of an exact command
> for cancellation and an arbitrary text string (reasoning for command
> cancelation). So CANCEL_COMMAND looks like
> {code}
> {
> target_task_id: "4-3"
> reason: "Aborted by user via API"
> }
> {code}
> 2. At the server side, commands of this type are issued automagically when
> tasks are considered timed out. I'm going to do that here:
> org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage.
> Also we will implement (in a separate jira) an ability to cancel arbitrary
> order via server API. A new method addCancelCommandAction() at
> org/apache/ambari/server/controller/AmbariCustomCommandExecutionHelper.java
> will become the endpoint that forms up a new CANCEL_COMMAND.
> 3. At the agent side, CANCEL_COMMANDs are executed inside Controller.py right
> after arrival (they are not put into ActionQueue). If command mentioned by
> the CANCEL_COMMAND is not present in the ActionQueue (it is already in
> progress or completed) and command is not IN_PROGRESS, CANCEL_COMMAND is
> silently ignored. After executing CANCEL_COMMAND, agent starts executing
> next EXECUTION_COMMAND from the ActionQueue.
> 4. Also, agent clears entire action queue on every registration (disconnected
> from the server or the re-registration is requested). I'm going to add an
> appropriate logic to
> src.main.python.ambari_agent.Controller.Controller#registerAndHeartbeat. The
> motivation for doing that is to make a recovery from the network/server fail
> more reliable and fast (agent will have an empty ActionQueue and will be able
> to execute new EXECUTION_COMMANDS and STATUS_COMMANDS immediately after
> registration). Currently, after re-registration agent is locked up and
> continues to execute stale EXECUTION_COMMANDS.
> 5. -In both cases described above (executing a single CANCEL_COMMAND or
> clearing entire ActionQueue) EXECUTION_COMMANDS are considered
> transactional-like. I mean that EXECUTION_COMMANDs that are already
> IN_PROGRESS are never interrupted. Thus we decrease chanses of leaving system
> in misconfigured/unpredictable state.- When appropriate CANCEL_COMMAND is
> received, EXECUTION_COMMAND is cancelled even if it is IN_PROGRESS.
> 6. Agent forms up command reports for cancelled commands just like it is done
> for COMPLETE and FAILED commands. Command statuses for cancelled commands are
> set to FAILED. I did not find enough reasoning for adding a new command
> report state CANCELED, feedback is welcome. Reasoning text (why command has
> been cancelled) is appended to command stderr and to command stdout.
> So, cancelled command report looks like:
> {code}
> {
> taskId: "4-3"
> status : FAILED
> stderr : ".... some text ... \n Command was aborted because of: Aborted by
> user via API "
> stdout : ".... some text ... \n Command was aborted because of: Aborted by
> user via API "
> exitcode: 999
> }
> {code}
> Also, I'm going to fix a bug at
> org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage
> . Here, we pass a stage timeout instead of task timeout as a parameter for
> org.apache.ambari.server.actionmanager.ActionScheduler#timeOutActionNeeded .
> After bugfix, task timeout + some small time will be passed as a parameter
> value. Additional smal time (10-30 seconds) is needed to avoid sending
> CANCEL_COMMAND when it is not absolutely necessary (task timeouts at agent
> automatically without server actions in most cases).
> This implementation should also solve another related jira AMBARI-4324
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)