I'm trying to figure out if/how management and agent restarts are
gracefully handled for long running jobs. My initial testing shows
that maybe they aren't. For example, if I try to migrate a storage
volume, and then restart the management server, I end up with two
volumes (source and destination) stuck in migrating state, with the VM
unable to start and the job stating:

            {
                "accountid": "505add16-12d8-11e3-8495-5254004eff4f",
                "cmd":
"org.apache.cloudstack.api.command.user.volume.MigrateVolumeCmd",
                "created": "2013-09-03T11:41:55-0600",
                "jobid": "698cc7cf-4ecc-40da-9bcf-261a7921ab95",
                "jobprocstatus": 0,
                "jobresult": {
                    "errorcode": 530,
                    "errortext": "job cancelled because of management
server restart"
                },
                "jobresultcode": 530,
                "jobresulttype": "object",
                "jobstatus": 2,
                "userid": "505bd5d6-12d8-11e3-8495-5254004eff4f"
            }

If all jobs react this way, it doesn't seem like a small bug, but
perhaps a design issue. If a job is cancelled, the state should be
rolled back, I think. Perhaps every job should have a cleanup method
that is called when the job is considered cancelled (assuming the
cancellation occurs prior to shutdown, but then that doesn't handle
crashes).

The end result is that everyone using cloudstack should be terrified
of restarting their mgmt server, I think, especially as their
environment grows and has many things going on. Anything that  goes
through a state machine could get stuck.

Reply via email to