I'm trying to figure out if/how management and agent restarts are gracefully handled for long running jobs. My initial testing shows that maybe they aren't. For example, if I try to migrate a storage volume, and then restart the management server, I end up with two volumes (source and destination) stuck in migrating state, with the VM unable to start and the job stating:
{ "accountid": "505add16-12d8-11e3-8495-5254004eff4f", "cmd": "org.apache.cloudstack.api.command.user.volume.MigrateVolumeCmd", "created": "2013-09-03T11:41:55-0600", "jobid": "698cc7cf-4ecc-40da-9bcf-261a7921ab95", "jobprocstatus": 0, "jobresult": { "errorcode": 530, "errortext": "job cancelled because of management server restart" }, "jobresultcode": 530, "jobresulttype": "object", "jobstatus": 2, "userid": "505bd5d6-12d8-11e3-8495-5254004eff4f" } If all jobs react this way, it doesn't seem like a small bug, but perhaps a design issue. If a job is cancelled, the state should be rolled back, I think. Perhaps every job should have a cleanup method that is called when the job is considered cancelled (assuming the cancellation occurs prior to shutdown, but then that doesn't handle crashes). The end result is that everyone using cloudstack should be terrified of restarting their mgmt server, I think, especially as their environment grows and has many things going on. Anything that goes through a state machine could get stuck.