[ 
https://issues.apache.org/jira/browse/AMBARI-4324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874923#comment-13874923
 ] 

Dmitry Lysnichenko commented on AMBARI-4324:
--------------------------------------------

[~sumitmohanty], 
{quote}
We will also need to support cancel for request or request/task through API. I 
am working on a proposal for that. Can you ensure there are hooks to add Cancel 
commands when it is initiated through a API call.
{quote}
I'll add a relevant hook addCancelCommandAction() to 
/org/apache/ambari/server/controller/AmbariCustomCommandExecutionHelper.java
{quote}
Typically, agents do not have a long queue of pending tasks. Just wondering if 
Controller.py should process it cancel synchronously. Perhaps it is OK.
{quote}
during cluster install, queue may contain 5-10 pending tasks (all tasks for 
stage) + 20 status commands, so it takes a long time to execute. Also, if we 
process "cancel" request synchronously, it will be placed to queue after task 
we are trying to cancel 

{quote}
We should have the ability to interrupt INPROGRESS tasks as well. I am thinking 
run-away tasks or misconfigured timeouts. What would it take to have this 
ability?
{quote}
Cancelling command in progress is possible (we may invoke kill-on-timeout 
callback method manually, effectively killing all subprocesses), but may 
occasionally leave system configuration in a broken state. 

{quote}
What is the other JIRA ID? The current link is to this JIRA.

    This implementation should also solve another related jira AMBARI-4324
{quote}
Implementation proposal was intended for AMBARI-4323, I've posted it to current 
(related) jira by mistake. 

> Server should rely on command reports when considering tasks timed out
> ----------------------------------------------------------------------
>
>                 Key: AMBARI-4324
>                 URL: https://issues.apache.org/jira/browse/AMBARI-4324
>             Project: Ambari
>          Issue Type: Improvement
>          Components: agent, controller
>    Affects Versions: 1.5.0
>            Reporter: Dmitry Lysnichenko
>            Assignee: Dmitry Lysnichenko
>             Fix For: 1.5.0
>
>
> As of now, task timeout at server and timeout at agent are two different 
> mechanisms, that currently work independently and duplicate each other. 
> Such behaviour leads to strange scenario:
> - cluster installation is started
> - execution of some command exceeds timeout
> - server considers this command and *all next* commands in request timed out. 
> This state is shown at UI as well.
> - at the same time, agent considers currently executed command timed out an 
> kills it. After that, agent starts executing the next command in queue. If 
> next commands does not fail, agent sends COMPLETE status reports.
> - server receives  COMPLETE status reports and updates component status.
> - if user clicks "Retry installation", only tasks for not installed 
> components are created.
> - as a result, UI shows less tasks than user expects
> Changes in scope of this jira:
> add TIMEDOUT command status report type at agent. At the server side, 
> HostRoleStatus enum already has this status type. Modify server behaviour: 
> server considers a task timed out when it receives appropriate command report 
> from the agent. In this case, all task time tracking logic is consolidated at 
> agent. Doing that will simplify timeout handling for CustomCommands and 
> CustomActions.
> Some issues may occur when agent host goes down and therefore does not send 
> any command reports. Server should have some handling for such case .



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to