[
https://issues.apache.org/jira/browse/AURORA-1041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14286219#comment-14286219
]
Maxim Khutornenko commented on AURORA-1041:
-------------------------------------------
https://reviews.apache.org/r/29943/
> Allow job uptime stats to control scheduler updater pace
> ---------------------------------------------------------
>
> Key: AURORA-1041
> URL: https://issues.apache.org/jira/browse/AURORA-1041
> Project: Aurora
> Issue Type: Task
> Components: Client, Scheduler
> Reporter: Maxim Khutornenko
> Assignee: Maxim Khutornenko
>
> The current implementation of the scheduler updater relies on a user-defined
> {{batch_size}} value to determine how many instances can be updated
> simultaneously. While this approach is well understood and battle tested, it
> comes with its own risks/inefficiencies:
> - No knowledge of job health outside of an active batch. Once an instance
> graduates the {{watch_secs}} interval it's considered "healthy" and is never
> looked at by the updater. Even if updated instances start flapping later, the
> updater keeps on going;
> - The {{batch_size}} fixed value may artificially slow down the updater
> progress as it's usually chosen conservatively as the max number of instances
> a service can tolerate at any given moment and may not reflect the actual job
> restart pace (see related AURORA-894).
> - Instances are evaluated/updated in a ordered fashion resulting in any new
> instances coming up at the very end of an update sequence that both updates
> the existing instances and adds new ones.
> The proposed solution will capitalize on the concept of *job uptime*
> introduced in AURORA-290 and will allow scheduler updater to proceed as long
> as the "X% of instances up over Y interval" job invariant is met.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)