Hi Rohit, I checked that. Thanks for the details!
-Suresh On Wed, May 16, 2018 at 4:55 PM, Rohit Yadav <rohit.ya...@shapeblue.com> wrote: > Hi Suresh, > > > As explained earlier and advised to look at code on the PR, perhaps you > did not get time so have a look here: > > https://github.com/apache/cloudstack/blob/4.11/agent/ > src/com/cloud/agent/Agent.java#L488 > > > The reconnect() historically sets the link to null. Therefore, any answer > from pending tasks end up failing here: > > https://github.com/apache/cloudstack/blob/4.11/agent/ > src/com/cloud/agent/Agent.java#L868 > > and, > > https://github.com/apache/cloudstack/blob/4.11/agent/ > src/com/cloud/agent/Agent.java#L893 > > > Do note that reconnect() only cancels watch tasks but does not > cancel/shutdown any running task. Also, in case of network error, the mgmt > server will fail at thread/context where is has done a agent.send() and > expecting an answer. > > > You can also perform a small test by doing a while or sleep around this > code to see how getLink().send() behave when agent does reconnect. When it > does not reconnect, i.e. the agent is blocked by pending tasks to complete > such tasks always fail. > > > - Rohit > > <https://cloudstack.apache.org> > > > > ________________________________ > From: Suresh Kumar Anaparti <sureshkumar.anapa...@gmail.com> > Sent: Wednesday, May 16, 2018 4:27:36 PM > To: dev@cloudstack.apache.org > Subject: Re: [DISCUSS][ASK] Should agent wait for pending tasks on (mgmt > server) disconnection? > > Hi Rohit, > > When Management Server and Agent are up and running and there is a network > failure, I think it is better to wait for some time for the pending tasks > to complete, instead of failing them and try reconnecting. If network delay > is minimal, there can be a valid thread/context in the management server to > handle the answers. > > It would be great if there are no major side-effects with this PR changes. > > Thanks, > Suresh > > On Wed, May 16, 2018 at 3:40 PM, Rohit Yadav <rohit.ya...@shapeblue.com> > wrote: > > > All, > > > > > > Based on testing against KVM, XenServer and VMware and this discussion, > > I'll merged the PR based on code reviews and tests. I investigated both > > code-wise and against live environment for possible side-effects of > letting > > agent connect without being blocked on pending tasks and I found no new > > fault behaviour. > > > > > > If there are any objections or bugs, please share in which case we'll > > revert the change to continue legacy/historic behaviour. Thanks. > > > > > > - Rohit > > > > <https://cloudstack.apache.org> > > > > > > > > ________________________________ > > From: Rohit Yadav <rohit.ya...@shapeblue.com> > > Sent: Tuesday, May 15, 2018 2:37:58 PM > > To: dev@cloudstack.apache.org > > Subject: Re: [DISCUSS][ASK] Should agent wait for pending tasks on (mgmt > > server) disconnection? > > > > Hi Suresh, > > > > > > I've replied to your comment on the PR. In addition, when (i) management > > server is restarted any pending operation on KVM/SSVM agent side will > fail > > fail to be communicated back in the correct thread/context and it depends > > on a specific feature whether is supports sync or cleanup mechanism, in > > most cases, the async/job timeout may kick in or cause queue/concurrent > > failure seen in logs. When (ii) agent is reconnected, it reconnects only > > after any pending job finishes therefore such jobs finish and fail to be > > communicated back to the mgmt server (the answer instance is failed to be > > sent on the link, as link is no longer valid and causes exception). > > > > > > - Rohit > > > > <https://cloudstack.apache.org> > > > > > > > > ________________________________ > > From: Suresh Kumar Anaparti <sureshkumar.anapa...@gmail.com> > > Sent: Tuesday, May 15, 2018 12:06:14 AM > > To: dev@cloudstack.apache.org > > Subject: Re: [DISCUSS][ASK] Should agent wait for pending tasks on (mgmt > > server) disconnection? > > > > Hi, > > > > @rhtyd, I checked the PR changes. Good that the agent is not waiting for > > the pending jobs and retrying connection to management server. This might > > have impact on ssvm and kvm agent tasks, not much on cpvm. Any sync or > > cleanup mechanism for Volumes/VMs to address the failed/pending agent > jobs > > after (i) management server restart and (ii) agent connected ? > > > > -Suresh > > > > On Mon, May 14, 2018 at 8:05 PM, Marc-Aurèle Brothier <ma...@exoscale.ch > > > > wrote: > > > > > Correct about the thread context, so if the answer is coming into a > > > management server that doesn't have the context and drops it, it should > > be > > > fine then. The PR is then already a good improvement to let the agent > > > reconnect even when it's doing a long processing request, so it can > keeps > > > on completing other jobs too. > > > > > > Regarding the restart/shutdown operation, yes I have to push now the > > > changes to be able to stop some processing tasks (fetching new async > jobs > > > mainly) on a management server to ensure a cleaner shutdown. My > solution, > > > as said, is based on the content of a file that is compatible with HA > > > proxy, thus not the LB mechanism added recently in CS. It could be > > changed > > > for an API call to put/move out a management server from maintenance. > The > > > listManagementServers API call has been merged and it was a requirement > > for > > > that. > > > > > > About Zookeeper, it's not on the rolling shutdown/restart for now. We > are > > > using it as an efficient and true lock mechanism between multiple > > > management servers. We are slowly moving the locks code towards ZK and > > > added one during the allocation phase to ensure no host would be over > > > allocated. I will take this discussion in another email threads since I > > > have a few questions regarding ZK and also which to talk about the > > > connection between the agent & management servers. > > > > > > On Mon, May 14, 2018 at 2:39 PM, Rohit Yadav < > rohit.ya...@shapeblue.com> > > > wrote: > > > > > > > Thanks Marc and Rafael for replying. > > > > > > > > > > > > In my experimentation, when agent disconnects if will wait for the > > > pending > > > > jobs/task to complete and on completion it creates an Answer instance > > and > > > > tries to sent it using a `link` which no longer exists and fails. > This > > is > > > > current behaviour, on the mgmt server side the resource/task will be > > left > > > > hanging and may not be automatically marked failed right away (may be > > > after > > > > the configured timeout). My best guess is that the application of the > > > > change should likely not have any side-effects, other than the > > > > exceptions/faults we already observe. > > > > > > > > > > > > In my test, the failed async job did not get retried and I hit the > > famour > > > > 'concurrency limit 1' issue. At this point, I had to manually cleanup > > the > > > > snapshot row, the rows from sync_queue, sync_queue_item and > async_job. > > > The > > > > current implementation we have on the agent side where mgmt server > > send a > > > > cmd and agent returns an answer after processing it -- we don't have > > the > > > > same for mgmt server where an agent sends a cmd's answer and mgmt > > server > > > > processes it irrespective of the context. Therefore, unless the > answer > > > > receiving mgmt server is not in the right thread/context/state those > > > > answers are dropped. > > > > > > > > > > > > I think we need to solve for (1) claim and ownership management of a > > > > resource (how to manage when the owner/mgmt server shuts down or > dies), > > > (2) > > > > task handover - executing tasks (in-flight) when mgmt server is > > shutdown > > > to > > > > other mgmt server, (3) central locking-service for this and other > uses. > > > The > > > > bigger change ties with the other things we've seen in the discussion > > > > around mgmt server restart/shutdown. Till the time we get to solving > > the > > > > bigger issue, perhaps we can provide some API/visual/UI ways to show > > the > > > > root admin the async jobs in flight for a management server or alert > > him, > > > > perhaps an API to do cleaner mgmt server shutdown that waits for all > > > > pending async jobs on a mgmg server to complete and does not take any > > new > > > > async/job API requests (say like Jenkins does with jobs)? > > > > > > > > > > > > Marc - were n't you working on a zookeeper based rolling > > > shutdown/restart? > > > > Did that handle some of the failure cases? > > > > > > > > > > > > - Rohit > > > > > > > > <https://cloudstack.apache.org> > > > > > > > > > > > > > > > > ________________________________ > > > > From: Marc-Aurèle Brothier <ma...@exoscale.ch> > > > > Sent: Monday, May 14, 2018 4:06:56 PM > > > > To: dev@cloudstack.apache.org > > > > Subject: Re: [DISCUSS][ASK] Should agent wait for pending tasks on > > (mgmt > > > > server) disconnection? > > > > > > > > Hi, > > > > > > > > I'm also for a bigger change but this PR already moves forward to a > > > better > > > > agent <-> management connection hanlding. > > > > > > > > @rhtyd did you test your PR manually by, for example, requesting a > long > > > > snapshot operation and disconnecting the agent. > > > > > > > > I have one concern here: when an async job is taken from the DB by a > > > > management server (in a cluster configuration), the mgmgt ID is put > in > > > the > > > > row to tell which mgmt is managing the job. On disconnection from an > > > agent, > > > > the event is propagated and the job is mark as failed in the > database, > > > and > > > > an error is return in the API for that command. Here we are only > > > resolving > > > > the fact to let the agent reconnect quickly but I'm unsure of what > will > > > > happen in the mgmt when the job response is received by a mgmt (which > > > might > > > > be another one than the one registered in the job db row). I know > it's > > > here > > > > it's becoming complicated because one async job might be only one > part > > > of a > > > > bigger scenario for a command (like a live migration). I just want to > > > > ensure it won't propagate further inconsistency. > > > > > > > > Marco > > > > > > > > On Sat, May 12, 2018 at 7:26 PM, Rafael Weingärtner < > > > > rafaelweingart...@gmail.com> wrote: > > > > > > > > > Would prefer “A bigger design fix would be to make management > server > > > > > asynchronous of agent side answer/response handling”. However, I > > > > understand > > > > > the volume of changes that requires. > > > > > > > > > > I looked at the PR, and I think that everything is ok there. Of > > > course, I > > > > > think we might need some more time to review and think about the > > > possible > > > > > outcomes of such changes. > > > > > > > > > > On Fri, May 11, 2018 at 7:55 AM, Rohit Yadav < > > > rohit.ya...@shapeblue.com> > > > > > wrote: > > > > > > > > > > > All, > > > > > > > > > > > > > > > > > > Historically, when the agent (kvm, ssvm, cpvm) is disconnected > from > > > the > > > > > > management server (say due to mgmt server restart etc), the > > > > reconnection > > > > > > logic waits for any pending tasks/commands to complete before > > > > > reconnection > > > > > > attempts are made. I tried to search git history but could not > > find a > > > > > > reason, can anyone share why we may need this? > > > > > > > > > > > > > > > > > > Based on the reported issue: > > > > > > > > > > > > https://github.com/apache/cloudstack/issues/2633 > > > > > > > > > > > > > > > > > > I've a working patch which removes this limitation: > > > > > > > > > > > > https://github.com/apache/cloudstack/pull/2638 > > > > > > > > > > > > > > > > > > From testing with various combinations of tasks, I found that > when > > > that > > > > > > happens even if the pending task succeeds it fails to send an > > Answer > > > to > > > > > the > > > > > > mgmt server, therefore from the control plane's perspective that > > task > > > > is > > > > > > still pending/on-going. > > > > > > > > > > > > > > > > > > When the mgmt server comes back online, and the agent finally > > > > reconnects > > > > > > (pending on how long the pending task took) the executed > operation > > is > > > > > still > > > > > > pending in mgmt server's view and may sometimes require manual > > > cleanups > > > > > in > > > > > > database. By removing the limitation in above PR, at least the > > agent > > > > > > reconnects faster while of the failure/fault behaviours remain > the > > > > same. > > > > > A > > > > > > bigger design fix would be to make management server asynchronous > > of > > > > > agent > > > > > > side answer/response handling. > > > > > > > > > > > > > > > > > > - Rohit > > > > > > > > > > > > <https://cloudstack.apache.org> > > > > > > > > > > > > > > > > > > > > > > > > rohit.ya...@shapeblue.com > > > > > > www.shapeblue.com<http://www.shapeblue.com> > > > > > > 53 Chandos Place, Covent Garden, London WC2N 4HSUK > > > > > > @shapeblue > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Rafael Weingärtner > > > > > > > > > > > > > rohit.ya...@shapeblue.com > > > > www.shapeblue.com<http://www.shapeblue.com> > > > > 53 Chandos Place, Covent Garden, London WC2N 4HSUK > > > > @shapeblue > > > > > > > > > > > > > > > > > > > > > > > rohit.ya...@shapeblue.com > > www.shapeblue.com<http://www.shapeblue.com> > > 53 Chandos Place, Covent Garden, London WC2N 4HSUK > > @shapeblue > > > > > > > > > > rohit.ya...@shapeblue.com > > www.shapeblue.com<http://www.shapeblue.com> > > 53 Chandos Place, Covent Garden, London WC2N 4HSUK > > @shapeblue > > > > > > > > > > rohit.ya...@shapeblue.com > www.shapeblue.com > 53 Chandos Place, Covent Garden, London WC2N 4HSUK > @shapeblue > > > >