I very much agree with Paul, we should consider moving into resilient model with least dependence I.e ha-proxy..
Send a notification to partner MS to take over the job management would be ideal. On Mon, Dec 18, 2017 at 9:28 AM Paul Angus <paul.an...@shapeblue.com> wrote: > Hi Marc-Aurèle, > > Personally, my utopia would be to be able to pass async jobs between mgmt. > servers. > So rather than waiting in indeterminate time for a snapshot to complete, > monitoring the job is passed to another management server. > > I would LOVE that something like Zookeeper monitored the state of the > mgmt. servers, so that 'other' management servers could take over the async > jobs in the (unlikely) event that a management server becomes unavailable. > > > > Kind regards, > > Paul Angus > > paul.an...@shapeblue.com > www.shapeblue.com > 53 Chandos Place, Covent Garden, London WC2N 4HSUK > @shapeblue > > > > > -----Original Message----- > From: Marc-Aurèle Brothier [mailto:ma...@exoscale.ch] > Sent: 18 December 2017 13:56 > To: dev@cloudstack.apache.org > Subject: [DISCUSS] Management server (pre-)shutdown to avoid killing jobs > > Hi everyone, > > Another point, another thread. Currently when shutting down a management > server, despite all the "stop()" method not being called as far as I know, > the server could be in the middle of processing an async job task. It will > lead to a failed job since the response won't be delivered to the correct > management server even though the job might have succeed on the agent. To > overcome this limitation due to our weekly production upgrades, we added a > pre-shutdown mechanism which works along side HA-proxy. The management > server keeps a eye onto a file "lb-agent" in which some keywords can be > written following the HA proxy guide ( > https://cbonte.github.io/haproxy-dconv/1.9/configuration.html#5.2-agent-check > ). > When it finds "maint", "stopped" or "drain", it stops those threads: > - AsyncJobManager._heartbeatScheduler: responsible to fetch and start > execution of AsyncJobs > - AlertManagerImpl._timer: responsible to send capacity check commands > - StatsCollector._executor: responsible to schedule stats command > > Then the management server stops most of its scheduled tasks. The correct > thing to do before shutting down the server would be to send > "rebalance/reconnect" commands to all agents connected on that management > server to ensure that commands won't go through this server at all. > > Here, HA-proxy is responsible to stop sending API requests to the > corresponding server with the help of this local agent check. > > In case you want to cancel the maintenance shutdown, you could write > "up/ready" in the file and the different schedulers will be restarted. > > This is really more a change for operation around CS for people doing live > upgrade on a regular basis, so I'm unsure if the community would want such > a change in the code base. It goes a bit in the opposite direction of the > change for removing the need of HA-proxy > https://github.com/apache/cloudstack/pull/2309 > > If there is enough positive feedback for such a change, I will port them > to match with the upstream branch in a PR. > > Kind regards, > Marc-Aurèle >