Correct about the thread context, so if the answer is coming into a
management server that doesn't have the context and drops it, it should be
fine then. The PR is then already a good improvement to let the agent
reconnect even when it's doing a long processing request, so it can keeps
on completing other jobs too.

Regarding the restart/shutdown operation, yes I have to push now the
changes to be able to stop some processing tasks (fetching new async jobs
mainly) on a management server to ensure a cleaner shutdown. My solution,
as said, is based on the content of a file that is compatible with HA
proxy, thus not the LB mechanism added recently in CS. It could be changed
for an API call to put/move out a management server from maintenance. The
listManagementServers API call has been merged and it was a requirement for
that.

About Zookeeper, it's not on the rolling shutdown/restart for now. We are
using it as an efficient and true lock mechanism between multiple
management servers. We are slowly moving the locks code towards ZK and
added one during the allocation phase to ensure no host would be over
allocated. I will take this discussion in another email threads since I
have a few questions regarding ZK and also which to talk about the
connection between the agent & management servers.

On Mon, May 14, 2018 at 2:39 PM, Rohit Yadav <rohit.ya...@shapeblue.com>
wrote:

> Thanks Marc and Rafael for replying.
>
>
> In my experimentation, when agent disconnects if will wait for the pending
> jobs/task to complete and on completion it creates an Answer instance and
> tries to sent it using a `link` which no longer exists and fails. This is
> current behaviour, on the mgmt server side the resource/task will be left
> hanging and may not be automatically marked failed right away (may be after
> the configured timeout). My best guess is that the application of the
> change should likely not have any side-effects, other than the
> exceptions/faults we already observe.
>
>
> In my test, the failed async job did not get retried and I hit the famour
> 'concurrency limit 1' issue. At this point, I had to manually cleanup the
> snapshot row, the rows from sync_queue, sync_queue_item and async_job.  The
> current implementation we have on the agent side where mgmt server send a
> cmd and agent returns an answer after processing it -- we don't have the
> same for mgmt server where an agent sends a cmd's answer and mgmt server
> processes it irrespective of the context. Therefore, unless the answer
> receiving mgmt server is not in the right thread/context/state those
> answers are dropped.
>
>
> I think we need to solve for (1) claim and ownership management of a
> resource (how to manage when the owner/mgmt server shuts down or dies), (2)
> task handover - executing tasks (in-flight) when mgmt server is shutdown to
> other mgmt server, (3) central locking-service for this and other uses. The
> bigger change ties with the other things we've seen in the discussion
> around mgmt server restart/shutdown. Till the time we get to solving the
> bigger issue,  perhaps we can provide some API/visual/UI ways to show the
> root admin the async jobs in flight for a management server or alert him,
> perhaps an API to do cleaner mgmt server shutdown that waits for all
> pending async jobs on a mgmg server to complete and does not take any new
> async/job API requests (say like Jenkins does with jobs)?
>
>
> Marc - were n't you working on a zookeeper based rolling shutdown/restart?
> Did that handle some of the failure cases?
>
>
> - Rohit
>
> <https://cloudstack.apache.org>
>
>
>
> ________________________________
> From: Marc-Aurèle Brothier <ma...@exoscale.ch>
> Sent: Monday, May 14, 2018 4:06:56 PM
> To: dev@cloudstack.apache.org
> Subject: Re: [DISCUSS][ASK] Should agent wait for pending tasks on (mgmt
> server) disconnection?
>
> Hi,
>
> I'm also for a bigger change but this PR already moves forward to a better
> agent <-> management connection hanlding.
>
> @rhtyd did you test your PR manually by, for example, requesting a long
> snapshot operation and disconnecting the agent.
>
> I have one concern here: when an async job is taken from the DB by a
> management server (in a cluster configuration), the mgmgt ID is put in the
> row to tell which mgmt is managing the job. On disconnection from an agent,
> the event is propagated and the job is mark as failed in the database, and
> an error is return in the API for that command. Here we are only resolving
> the fact to let the agent reconnect quickly but I'm unsure of what will
> happen in the mgmt when the job response is received by a mgmt (which might
> be another one than the one registered in the job db row). I know it's here
> it's becoming complicated because one async job might be only one part of a
> bigger scenario for a command (like a live migration). I just want to
> ensure it won't propagate further inconsistency.
>
> Marco
>
> On Sat, May 12, 2018 at 7:26 PM, Rafael Weingärtner <
> rafaelweingart...@gmail.com> wrote:
>
> > Would prefer “A bigger design fix would be to make management server
> > asynchronous of agent side answer/response handling”. However, I
> understand
> > the volume of changes that requires.
> >
> > I looked at the PR, and I think that everything is ok there. Of course, I
> > think we might need some more time to review and think about the possible
> > outcomes of such changes.
> >
> > On Fri, May 11, 2018 at 7:55 AM, Rohit Yadav <rohit.ya...@shapeblue.com>
> > wrote:
> >
> > > All,
> > >
> > >
> > > Historically, when the agent (kvm, ssvm, cpvm) is disconnected from the
> > > management server (say due to mgmt server restart etc), the
> reconnection
> > > logic waits for any pending tasks/commands to complete before
> > reconnection
> > > attempts are made. I tried to search git history but could not find a
> > > reason, can anyone share why we may need this?
> > >
> > >
> > > Based on the reported issue:
> > >
> > > https://github.com/apache/cloudstack/issues/2633
> > >
> > >
> > > I've a working patch which removes this limitation:
> > >
> > > https://github.com/apache/cloudstack/pull/2638
> > >
> > >
> > > From testing with various combinations of tasks, I found that when that
> > > happens even if the pending task succeeds it fails to send an Answer to
> > the
> > > mgmt server, therefore from the control plane's perspective that task
> is
> > > still pending/on-going.
> > >
> > >
> > > When the mgmt server comes back online, and the agent finally
> reconnects
> > > (pending on how long the pending task took) the executed operation is
> > still
> > > pending in mgmt server's view and may sometimes require manual cleanups
> > in
> > > database. By removing the limitation in above PR, at least the agent
> > > reconnects faster while of the failure/fault behaviours remain the
> same.
> > A
> > > bigger design fix would be to make management server asynchronous of
> > agent
> > > side answer/response handling.
> > >
> > >
> > > - Rohit
> > >
> > > <https://cloudstack.apache.org>
> > >
> > >
> > >
> > > rohit.ya...@shapeblue.com
> > > www.shapeblue.com<http://www.shapeblue.com>
> > > 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > @shapeblue
> > >
> > >
> > >
> > >
> >
> >
> > --
> > Rafael Weingärtner
> >
>
> rohit.ya...@shapeblue.com
> www.shapeblue.com
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> @shapeblue
>
>
>
>

Reply via email to