Maxim Khutornenko created AURORA-1041:
-----------------------------------------

             Summary: Allow job uptime stats to control scheduler updater pace 
                 Key: AURORA-1041
                 URL: https://issues.apache.org/jira/browse/AURORA-1041
             Project: Aurora
          Issue Type: Task
          Components: Client, Scheduler
            Reporter: Maxim Khutornenko
            Assignee: Maxim Khutornenko


The current implementation of the scheduler updater relies on a user-defined 
{{batch_size}} value to determine how many instances can be updated 
simultaneously. While this approach is well understood and battle tested, it 
comes with its own risks/inefficiencies:
- No knowledge of job health outside of an active batch. Once an instance 
graduates the {{watch_secs}} interval it's considered "healthy" and is never 
looked at by the updater. Even if updated instances start flapping later, the 
updater keeps on going;
- The {{batch_size}} fixed value may artificially slow down the updater 
progress as it's usually chosen conservatively as the max number of instances a 
service can tolerate at any given moment and may not reflect the actual job 
restart pace (see related AURORA-894).
- Instances are evaluated/updated in a ordered fashion resulting in any new 
instances coming up at the very end of an update sequence that both updates the 
existing instances and adds new ones.

The proposed solution will capitalize on the concept of *job uptime* introduced 
in AURORA-290 and will allow scheduler updater to proceed as long as the "X% of 
instances up over Y interval" job invariant is met.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to