Igor Morozov created AURORA-1749:
------------------------------------

             Summary: Get a support for distributed job update coordination
                 Key: AURORA-1749
                 URL: https://issues.apache.org/jira/browse/AURORA-1749
             Project: Aurora
          Issue Type: Story
          Components: Scheduler
            Reporter: Igor Morozov


This is for a use case to update jobs that are the same but spread across 
multiple datacenters and managed by different aurora clusters.
For example we have a service job test-service that runs in two datacenters dc1 
and dc1.

Logically the job needs to be updated in a single lock step across multiple 
data centers and if any job update fails and goes into ROLLING_BACK state
all the others need to start a rollback as well.
 
This is what we want to achieve with this change:

1. Coordinator starts an upgrade:
    dc1: -> starting update1 for job1
    dc2: -> staring update2 for job2
2. Coordinator:
    dc1: update1 is done, enters paused state
    dc2: update2 has failed, rolling back
3. Coordinator:
    dc1: starts rolling back update 1
    dc2: update 2 is rolled back
4. Coordinator:
    dc1: update 1 is rolled back
    dc2: update 2 is rolled back

Currently step 2 is impossible in aurora as job update enters the terminal 
state and could not be rolled back from it.
There was some discussion in AURORA-1721 ticket regarding using another job 
update to roll forward the job to a previous version effectively simulating a 
rollback. But now in order to reconcile the state of an actual update operation 
one would need to consider two or more update jobs and differentiate between 
normal ROLLED_FORWARD vs ROLLED_FORWARD(rollback) jobs. That feels quite 
artificial error prone. We believe an ability to run a coordinate job update 
across multiple data centers should be a first class citizen in aurora



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to