Maxim Khutornenko created AURORA-1023:
-----------------------------------------

             Summary: Releasing the update lock trips off scheduler updater
                 Key: AURORA-1023
                 URL: https://issues.apache.org/jira/browse/AURORA-1023
             Project: Aurora
          Issue Type: Bug
          Components: Scheduler
            Reporter: Maxim Khutornenko
            Priority: Critical


Here is the faulty sequence:
- User starts a scheduler job update and pauses while it's still in progress
- User runs "aurora job cancel-update" command thus releasing the update lock
- User starts a new scheduler job update

At this point, any attempt to abort or pause an active update results in the 
following error [1]:
{noformat}
vagrant@vagrant-ubuntu-trusty-64:~$ aurora beta-update abort 
devcluster/www-data/prod/hello
 INFO] Aborting update for: devcluster/www-data/prod/hello
Failed to abort update due to error:
        expected one element but was: 
<JobUpdateSummary(updateId:4b7fdc14-428f-44e4-9261-908b606f47e2, 
jobKey:JobKey(role:www-data, environment:prod, name:hello), user:UNSECURE, 
state:JobUpdateState(status:ROLLING_FORWARD, createdTimestampMs:1421450382234, 
lastModifiedTimestampMs:1421450382234)), 
JobUpdateSummary(updateId:3c9c2fa2-8e51-4c13-8440-94364205a37b, 
jobKey:JobKey(role:www-data, environment:prod, name:hello), user:UNSECURE, 
state:JobUpdateState(status:ROLL_FORWARD_PAUSED, 
createdTimestampMs:1421450304935, lastModifiedTimestampMs:1421450324055))>
{noformat}

The only way to recover from this state is either wait for the active job 
update to reach terminal state or force it to it by running another 
cancel-update.

While the "cancel-update" will eventually go away with the client updater, we 
do have a problem during the migration period. A possible (though ugly) 
short-term workaround could be calling "abortJobUpdate" from the "releaseLock" 
RPC.

[1] - 
https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java#L295-L296



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to