[
https://issues.apache.org/jira/browse/AURORA-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400116#comment-15400116
]
David McLaughlin commented on AURORA-1721:
------------------------------------------
I am also +1 to (1). I like the idea of explicitly being able to force the
update into ROLLING_BACK while it is in progress.
I'm -1 to (2). A valid state transition is a user explicitly pausing the
update. Should we block that until a pulse? It just adds needless complexity to
the product and UX.
If you have something external that knows how to send a pulse to verify
everything is ok, can't that thing also keep track of previous state in order
to rollback? In fact I'd say longer term you will eventually run into the
situation when someone comes to you and says they need to rollback after
several hours (or even days) of being in production, long after you decided to
mark the updates as ROLLED_FORWARD. This is what influenced our solution at
Twitter (to save JobUpdateRequests and allow users to replay them).
You may have concerns about - how do we get a previous state in order to call
the Aurora API? A JobUpdateRequest to represent the previous state can easily
be constructed from the JobUpdateDetails. The JobUpdateRequest has only three
parameters - a TaskConfig, instance count and the JobUpdateSettings.
{code}
struct JobUpdateRequest {
/** Desired TaskConfig to apply. */
1: TaskConfig taskConfig
/** Desired number of instances of the task config. */
2: i32 instanceCount
/** Update settings and limits. */
3: JobUpdateSettings settings
}
{code}
To get these values from a JobUpdateDetails:
{code}
struct JobUpdateDetails {
/** Update definition. */
1: JobUpdate update
...
}
/** Full definition of the job update. */
struct JobUpdate {
...
/** Update configuration. */
2: JobUpdateInstructions instructions
}
struct JobUpdateInstructions {
/** Actual InstanceId -> TaskConfig mapping when the update was requested. */
1: set<InstanceTaskConfig> initialState
/** Update specific settings. */
3: JobUpdateSettings settings
}
struct InstanceTaskConfig {
/** A TaskConfig associated with instances. */
1: TaskConfig task
/** Instances associated with the TaskConfig. */
2: set<Range> instances
}
{code}
So using these structs, a rollback algorithm based on JobUpdateKey could be:
{code}
JobUpdateDetails oldUpdate = client.getJobUpdateDetails(new JobUpdateKey(...));
for (InstanceTaskConfig iConfig: oldUpdate.update.instructions.initialState) {
JobUpdateRequest rollbackRequest = new JobUpdateRequest(
iConfig.task,
calculateInstanceCount(iConfig.instances),
oldUpdate.update.instructions.settings
);
client.startJobUpdate(rollbackRequest);
}
{code}
In practice though, it might be easier for you to just persist JobUpdateRequest
instances from the client (we added a custom noun in our client to generate the
JobUpdateRequest from the Aurora DSL, can discuss if you want) and store them
for rollbacks.
Thoughts?
> Support user initiated rollback
> --------------------------------
>
> Key: AURORA-1721
> URL: https://issues.apache.org/jira/browse/AURORA-1721
> Project: Aurora
> Issue Type: Task
> Components: Scheduler
> Reporter: Igor Morozov
> Assignee: Igor Morozov
> Labels: Uber
> Fix For: 0.16.0
>
>
> The proposal to support user initiated rollback:
> 1. Create new thrift API:
> /**Rollback job update. */
> Response rollbackJobUpdate(
> /** The update to rollback. */
> 1: JobUpdateKey key,
> /** A user-specified message to include with the induced job update
> state change. */
> 3: string message)
> 2. Implement new API in a scheduler so the implementation would just undo
> the latest JobUpdate effectively trying to apply initialState to the job. If
> that is for some reason is impossible them rollback with fail with
> appropriate error message.
> 3. Support new aurora client command 'rollback'
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)