Hi aurora people,

I would like to start a discussion around few things we would like to see
supported in aurora scheduler. It is based on our experience of integrating
aurora into Uber infrastructure and I believe all the items I'm going to
talk about will benefit the community and people running aurora clusters.

1. We support multiple aurora clusters in different failure domains and we
run services in those domains. The upgrade workflow for those services
includes rolling out the same version of a service software to all aurora
clusters concurrently while monitoring the health status and other service
vitals that includes like checking error logs, service stats,
downstream/upstream services health. That means we occasionally need to
manually trigger a rollback if things go south and rollback all the update
jobs in all aurora clusters for that particular service. So here are the
problems we discovered so far with this approach:

       - We don't have an easy way to assign a common unique identifier for
all JobUpdates in different aurora clusters in order to reconcile them
later into a single meta update job so to speak. Instead we need to
generate that ID and keep it in every aurora's JobUpdate
metadata(JobUpdateRequest.taskConfig). Then in order to get the status the
upgrade workflow running in different data centers we have to query all
recent jobs and based on their metadata content try to filter in ones that
we thing belongs to a currently running upgrade for the service.

We propose to change
struct JobUpdateRequest {
  /** Desired TaskConfig to apply. */
  1: TaskConfig taskConfig

  /** Desired number of instances of the task config. */
  2: i32 instanceCount

  /** Update settings and limits. */
  3: JobUpdateSettings settings

*  /**Optional Job Update key's id, if not specified aurora will generate
one**/*

*  4: optional string id*}

There is potentially another much more involved solution of supporting user
defined metadata mentioned in this ticket:
https://issues.apache.org/jira/browse/AURORA-1711


    -  All that brings us to a second problem we had to deal with during
the upgrade:
We don't have a good way to manually trigger a job update rollback in
aurora. The use case is again the same, while running multiple update jobs
in different aurora clusters we have a real production requirement to start
rolling back update jobs if things are misbehaving and the nature of this
misbehavior could be potentially very complex. Currently we abort the job
update and start a new one that would essentially roll cluster forward to a
previously run version of the software.

We propose a new convenience API to rollback a running or complete
JobUpdate:

*  /**Rollback job update. */*
*  Response rollbackJobUpdate(*
*      /** The update to rollback. */*
*      1: JobUpdateKey key,*
*      /** A user-specified message to include with the induced job update
state change. */*
*      3: string message)*

2. The next problem is related to the way we collect  service cluster
status. I couldn't find a way to quickly get latest statuses for all
instances/shards for a job in one query. Instead we query all task statuses
for a job, them manually iterate through all the statuses and filter the
latest one in grouped by instance ids. For services with lots of churn on
tasks statuses that means huge blobs of thrift transferred every time we
issue a query. I was thinking adding something in this line:
struct TaskQuery {
  // TODO(maxim): Remove in 0.7.0. (AURORA-749)
  8: Identity owner
  14: string role
  9: string environment
  2: string jobName
  4: set<string> taskIds
  5: set<ScheduleStatus> statuses
  7: set<i32> instanceIds
  10: set<string> slaveHosts
  11: set<JobKey> jobKeys
  12: i32 offset
  13: i32 limit
*  14: i32 limit_per_instance*
}

but I'm less certain on API here so any help would be welcome.

All the changes we propose would be backward compatible.

-- 
-Igor

Reply via email to