Hi all,

I'd like to implement some new functionality in Aurora allowing for rolling
job restarts. There are many reasons why we might need to restart a job,
e.g. freeing instances of a job from deadlock or refreshing some sort of
external configuration.

Currently, there are two options to execute a rolling restart, however both
are undesirable — either use the restartShards endpoint and implement
batching client-side, or use startJobUpdate with slightly modified task
config so that a non-empty job diff forces an update. I propose adding a
new thrift RPC for launching a rolling restart, which is an interface
around the existing upgrade logic. Instead of requiring a TaskConfig and
instanceCount, this restart endpoint will only accept JobUpdateSettings and
will simply launch an update with the currently used task configuration.
All of the existing job update RPCs will still be able to access updates
which were launched from this restart endpoint. This ensures restarts are
available in the UI and no additional storage changes are required.

If this proposal seems reasonable, I’ll file a ticket and draft up a more
detailed RFC for further review.

Cody

Reply via email to