[ 
https://issues.apache.org/jira/browse/AMBARI-4324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874882#comment-13874882
 ] 

Dmitry Lysnichenko commented on AMBARI-4324:
--------------------------------------------

h1. Implementation proposal:

1. Add new command type CANCEL_COMMAND to agent-server protocol. CANCEL_COMMAND 
contains identifier (task_id + stage_id) of an exact command for cancellation.
2. At the server side, commands of this type are issued when tasks are 
considered timed out. I'm going to do that here: 
org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage.
3. At the agent side, CANCEL_COMMANDs are executed inside Controller.py right 
after arrival (they are not put into ActionQueue). If command mentioned by the 
CANCEL_COMMAND is not present in ActionQueue (it is already in progress or 
completed), CANCEL_COMMAND is silently ignored.
4. Also, agent clears entire action queue when it can not continue exchanging 
heartbeats with the server (disconnect or registration requested). I'm going to 
add an appropriate logic to 
src.main.python.ambari_agent.Controller.Controller#registerAndHeartbeat. The 
motivation is to make recovery from network/server fail more reliable and fast 
(agent will have an empty ActionQueue and can start executing new 
EXECUTION_COMMANDS and STATUS_COMMANDS right after registration).
5. In both cases described above (executing a single CANCEL_COMMAND or clearing 
entire ActionQueue) EXECUTION_COMMANDS are considered transactional-like.  I 
mean that EXECUTION_COMMANDs that are already IN_PROGRESS are never 
interrupted. Thus we decrease chanses of leaving system in 
misconfigured/unpredictable state.

Also, I'm going to fix a bug at 
org.apache.ambari.server.actionmanager.ActionScheduler#processInProgressStage . 
Here, we pass stage timeout instead of task timeout as a parameter of 
org.apache.ambari.server.actionmanager.ActionScheduler#timeOutActionNeeded . 
After bugfix, task timeout + some small time will be passed as a parameter 
value. Additional smal time (10-30 seconds) is needed to avoid sending 
CANCEL_COMMAND without absolute necessary (task will timeout at agent 
automatically in most cases).

This implementation should also solve another related jira AMBARI-4324

[~mahadev] and/or [~sumitmohanty], can you please take a look on this proposal?


> Server should rely on command reports when considering tasks timed out
> ----------------------------------------------------------------------
>
>                 Key: AMBARI-4324
>                 URL: https://issues.apache.org/jira/browse/AMBARI-4324
>             Project: Ambari
>          Issue Type: Improvement
>          Components: agent, controller
>    Affects Versions: 1.5.0
>            Reporter: Dmitry Lysnichenko
>            Assignee: Dmitry Lysnichenko
>             Fix For: 1.5.0
>
>
> As of now, task timeout at server and timeout at agent are two different 
> mechanisms, that currently work independently and duplicate each other. 
> Such behaviour leads to strange scenario:
> - cluster installation is started
> - execution of some command exceeds timeout
> - server considers this command and *all next* commands in request timed out. 
> This state is shown at UI as well.
> - at the same time, agent considers currently executed command timed out an 
> kills it. After that, agent starts executing the next command in queue. If 
> next commands does not fail, agent sends COMPLETE status reports.
> - server receives  COMPLETE status reports and updates component status.
> - if user clicks "Retry installation", only tasks for not installed 
> components are created.
> - as a result, UI shows less tasks than user expects
> Changes in scope of this jira:
> add TIMEDOUT command status report type at agent. At the server side, 
> HostRoleStatus enum already has this status type. Modify server behaviour: 
> server considers a task timed out when it receives appropriate command report 
> from the agent. In this case, all task time tracking logic is consolidated at 
> agent. Doing that will simplify timeout handling for CustomCommands and 
> CustomActions.
> Some issues may occur when agent host goes down and therefore does not send 
> any command reports. Server should have some handling for such case .



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to