Use case:
In any environment - time to time - administrator needs to perform a
maintenance. Current stop sequence of cloudstack management server will
ignore the fact that there may be long running async jobs - and terminate
the process. This in turn can create a poor user experience and occasional
inconsistency in cloudstack db.
This is especially painful in large environments where the user has
thousands of nodes and there is a continuous patching that happens around
the clock - that requires migration of workload from one node to another.
With that said - i've created a script that monitors the async job queue
for given MS and waits for it complete all jobs. More details are posted
below.
I'd like to introduce "graceful-shutdown" into the systemctl/service of
cloudstack-management service.
The details of how it will work is below:
Workflow for graceful shutdown:
Using iptables/firewalld - block any connection attempts on 8080/8443 (we
can identify the ports dynamically)
Identify the MSID for the node, using the proper msid - query async_job
table for
1) any jobs that are still running (or job_status=“0”)
2) job_dispatcher not like “pseudoJobDispatcher"
3) job_init_msid=$my_ms_id
Monitor this async_job table for 60 minutes - until all async jobs for MSID
are done, then proceed with shutdown
If failed for any reason or terminated, catch the exit via trap command
and unblock the 8080/8443
Comments are welcome
Regards,
ilya