I very much agree with Paul, we should consider moving into resilient model
with least dependence I.e ha-proxy..

Send a notification to partner MS to take over the job management would be
ideal.

On Mon, Dec 18, 2017 at 9:28 AM Paul Angus <paul.an...@shapeblue.com> wrote:

> Hi Marc-Aurèle,
>
> Personally, my utopia would be to be able to pass async jobs between mgmt.
> servers.
> So rather than waiting in indeterminate time for a snapshot to complete,
> monitoring the job is passed to another management server.
>
> I would LOVE that something like Zookeeper monitored the state of the
> mgmt. servers, so that 'other' management servers could take over the async
> jobs in the (unlikely) event that a management server becomes unavailable.
>
>
>
> Kind regards,
>
> Paul Angus
>
> paul.an...@shapeblue.com
> www.shapeblue.com
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> @shapeblue
>
>
>
>
> -----Original Message-----
> From: Marc-Aurèle Brothier [mailto:ma...@exoscale.ch]
> Sent: 18 December 2017 13:56
> To: dev@cloudstack.apache.org
> Subject: [DISCUSS] Management server (pre-)shutdown to avoid killing jobs
>
> Hi everyone,
>
> Another point, another thread. Currently when shutting down a management
> server, despite all the "stop()" method not being called as far as I know,
> the server could be in the middle of processing an async job task. It will
> lead to a failed job since the response won't be delivered to the correct
> management server even though the job might have succeed on the agent. To
> overcome this limitation due to our weekly production upgrades, we added a
> pre-shutdown mechanism which works along side HA-proxy. The management
> server keeps a eye onto a file "lb-agent" in which some keywords can be
> written following the HA proxy guide (
> https://cbonte.github.io/haproxy-dconv/1.9/configuration.html#5.2-agent-check
> ).
> When it finds "maint", "stopped" or "drain", it stops those threads:
>  - AsyncJobManager._heartbeatScheduler: responsible to fetch and start
> execution of AsyncJobs
>  - AlertManagerImpl._timer: responsible to send capacity check commands
>  - StatsCollector._executor: responsible to schedule stats command
>
> Then the management server stops most of its scheduled tasks. The correct
> thing to do before shutting down the server would be to send
> "rebalance/reconnect" commands to all agents connected on that management
> server to ensure that commands won't go through this server at all.
>
> Here, HA-proxy is responsible to stop sending API requests to the
> corresponding server with the help of this local agent check.
>
> In case you want to cancel the maintenance shutdown, you could write
> "up/ready" in the file and the different schedulers will be restarted.
>
> This is really more a change for operation around CS for people doing live
> upgrade on a regular basis, so I'm unsure if the community would want such
> a change in the code base. It goes a bit in the opposite direction of the
> change for removing the need of HA-proxy
> https://github.com/apache/cloudstack/pull/2309
>
> If there is enough positive feedback for such a change, I will port them
> to match with the upstream branch in a PR.
>
> Kind regards,
> Marc-Aurèle
>

Reply via email to