[ 
https://issues.apache.org/jira/browse/AURORA-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15453730#comment-15453730
 ] 

Justin Pinkul commented on AURORA-1602:
---------------------------------------

We recently hit a failure scenario in Mesos where this tool would have helped. 
Our cluster experienced a full power outage recently and the Aurora scheduler's 
recovered faster than all of the Mesos agents. When Aurora sent the request for 
explicit task reconciliation to the Mesos master the Mesos master had to drop 
the request due to nodes being in a transitory state. This state only lasted 
for a couple of minutes longer but Aurora did not perform reconciliation for 
another reconciliation_explicit_interval (in our case the default of 60min). 
The impact of this was that it took the Aurora scheduler an extra hour to 
reschedule existing jobs that were lost due to the power outage. If there was a 
tool that could trigger this reconciliation the cluster could have been 
recovered faster.

> Add aurora_admin command to trigger reconciliation 
> ---------------------------------------------------
>
>                 Key: AURORA-1602
>                 URL: https://issues.apache.org/jira/browse/AURORA-1602
>             Project: Aurora
>          Issue Type: Task
>          Components: Client
>            Reporter: Zameer Manji
>
> Currently reconciliation runs on a fixed schedule. Adding an admin RPC to 
> trigger it is useful for operators who want to speed up cluster recovery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to