Re: Few things we would like to support in aurora scheduler

Bill Farner Thu, 16 Jun 2016 17:27:27 -0700

>
> We don't have an easy way to assign a common unique identifier for
> all JobUpdates in different aurora clusters in order to reconcile them
> later into a single meta update job so to speak. Instead we need to
> generate that ID and keep it in every aurora's JobUpdate
> metadata(JobUpdateRequest.taskConfig). Then in order to get the status the
> upgrade workflow running in different data centers we have to query all
> recent jobs and based on their metadata content try to filter in ones that
> we thing belongs to a currently running upgrade for the service.



Can you elaborate on the shortcoming of using TaskConfig.metadata?  From a
quick read, it seems like your proposal does with an explicit field what
you can accomplish with the more versatile metadata field.  For example,
you could store a git commit SHA in TaskConfig.metadata, and identify the
commit in use by each instance as well as track the revision changes when a
job is updated.

However, i feel like i may be missing some context as "query all recent
jobs" sounds like a broader query scope than i would expect.

We propose a new convenience API to rollback a running or complete
> JobUpdate:
> *  /**Rollback job update. */*
> *  Response rollbackJobUpdate(*
> *      /** The update to rollback. */*
> *      1: JobUpdateKey key,*
> *      /** A user-specified message to include with the induced job update
> state change. */*
> *      3: string message)*


I think this is a great idea!  It's something i've thought about for a
while, but haven't really had the personal need.

The next problem is related to the way we collect  service cluster
> status. I couldn't find a way to quickly get latest statuses for all
> instances/shards for a job in one query. Instead we query all task statuses
> for a job, them manually iterate through all the statuses and filter the
> latest one in grouped by instance ids. For services with lots of churn on
> tasks statuses that means huge blobs of thrift transferred every time we
> issue a query. I was thinking adding something in this line:


Does a TaskQuery filtering by job key and ACTIVE_STATES solve this?  Still
includes the TaskConfig, but it's a single query, and probably rarely
exceeds 1 MB in response payload.


On Thu, Jun 16, 2016 at 1:28 PM, Igor Morozov <igm...@gmail.com> wrote:

> Hi aurora people,
>
> I would like to start a discussion around few things we would like to see
> supported in aurora scheduler. It is based on our experience of integrating
> aurora into Uber infrastructure and I believe all the items I'm going to
> talk about will benefit the community and people running aurora clusters.
>
> 1. We support multiple aurora clusters in different failure domains and we
> run services in those domains. The upgrade workflow for those services
> includes rolling out the same version of a service software to all aurora
> clusters concurrently while monitoring the health status and other service
> vitals that includes like checking error logs, service stats,
> downstream/upstream services health. That means we occasionally need to
> manually trigger a rollback if things go south and rollback all the update
> jobs in all aurora clusters for that particular service. So here are the
> problems we discovered so far with this approach:
>
>        - We don't have an easy way to assign a common unique identifier for
> all JobUpdates in different aurora clusters in order to reconcile them
> later into a single meta update job so to speak. Instead we need to
> generate that ID and keep it in every aurora's JobUpdate
> metadata(JobUpdateRequest.taskConfig). Then in order to get the status the
> upgrade workflow running in different data centers we have to query all
> recent jobs and based on their metadata content try to filter in ones that
> we thing belongs to a currently running upgrade for the service.
>
> We propose to change
> struct JobUpdateRequest {
>   /** Desired TaskConfig to apply. */
>   1: TaskConfig taskConfig
>
>   /** Desired number of instances of the task config. */
>   2: i32 instanceCount
>
>   /** Update settings and limits. */
>   3: JobUpdateSettings settings
>
> *  /**Optional Job Update key's id, if not specified aurora will generate
> one**/*
>
> *  4: optional string id*}
>
> There is potentially another much more involved solution of supporting user
> defined metadata mentioned in this ticket:
> https://issues.apache.org/jira/browse/AURORA-1711
>
>
>     -  All that brings us to a second problem we had to deal with during
> the upgrade:
> We don't have a good way to manually trigger a job update rollback in
> aurora. The use case is again the same, while running multiple update jobs
> in different aurora clusters we have a real production requirement to start
> rolling back update jobs if things are misbehaving and the nature of this
> misbehavior could be potentially very complex. Currently we abort the job
> update and start a new one that would essentially roll cluster forward to a
> previously run version of the software.
>
> We propose a new convenience API to rollback a running or complete
> JobUpdate:
>
> *  /**Rollback job update. */*
> *  Response rollbackJobUpdate(*
> *      /** The update to rollback. */*
> *      1: JobUpdateKey key,*
> *      /** A user-specified message to include with the induced job update
> state change. */*
> *      3: string message)*
>
> 2. The next problem is related to the way we collect  service cluster
> status. I couldn't find a way to quickly get latest statuses for all
> instances/shards for a job in one query. Instead we query all task statuses
> for a job, them manually iterate through all the statuses and filter the
> latest one in grouped by instance ids. For services with lots of churn on
> tasks statuses that means huge blobs of thrift transferred every time we
> issue a query. I was thinking adding something in this line:
> struct TaskQuery {
>   // TODO(maxim): Remove in 0.7.0. (AURORA-749)
>   8: Identity owner
>   14: string role
>   9: string environment
>   2: string jobName
>   4: set<string> taskIds
>   5: set<ScheduleStatus> statuses
>   7: set<i32> instanceIds
>   10: set<string> slaveHosts
>   11: set<JobKey> jobKeys
>   12: i32 offset
>   13: i32 limit
> *  14: i32 limit_per_instance*
> }
>
> but I'm less certain on API here so any help would be welcome.
>
> All the changes we propose would be backward compatible.
>
> --
> -Igor
>

Re: Few things we would like to support in aurora scheduler

Reply via email to