I created two tickets to track the discussion there: https://issues.apache.org/jira/browse/AURORA-1721 https://issues.apache.org/jira/browse/AURORA-1722
I'm willing to work on rollback and potentially (depending on a result of the discussion) on adding TaskQuery flag. Thanks, -Igor On Sun, Jun 19, 2016 at 8:24 AM, Erb, Stephan <stephan....@blue-yonder.com> wrote: > > >> The next problem is related to the way we collect service cluster > >> status. I couldn't find a way to quickly get latest statuses for all > >> instances/shards for a job in one query. Instead we query all task > statuses > >> for a job, them manually iterate through all the statuses and filter the > >> latest one in grouped by instance ids. For services with lots of churn > on > >> tasks statuses that means huge blobs of thrift transferred every time we > > issue a query. I was thinking adding something in this line: > > > > > >Does a TaskQuery filtering by job key and ACTIVE_STATES solve this? Still > >includes the TaskConfig, but it's a single query, and probably rarely > >exceeds 1 MB in response payload. > > We have a related problem, where we are interested in the status of the > last executed cron job. Unfortunately, ACTIVE_STATES don’t help here. One > potential solution I have thought about was a flag in TaskQuery for > enabling server-side sorting of tasks by their latest event time. We could > then query the status of the latest run by using this flag in combination > with limit=1. This could also be composed by the limit_per_instance flag to > guarantee the usecase mentioned here. > > > > On Thu, Jun 16, 2016 at 1:28 PM, Igor Morozov <igm...@gmail.com> wrote: > > > Hi aurora people, > > > > I would like to start a discussion around few things we would like to see > > supported in aurora scheduler. It is based on our experience of > integrating > > aurora into Uber infrastructure and I believe all the items I'm going to > > talk about will benefit the community and people running aurora clusters. > > > > 1. We support multiple aurora clusters in different failure domains and > we > > run services in those domains. The upgrade workflow for those services > > includes rolling out the same version of a service software to all aurora > > clusters concurrently while monitoring the health status and other > service > > vitals that includes like checking error logs, service stats, > > downstream/upstream services health. That means we occasionally need to > > manually trigger a rollback if things go south and rollback all the > update > > jobs in all aurora clusters for that particular service. So here are the > > problems we discovered so far with this approach: > > > > - We don't have an easy way to assign a common unique identifier > for > > all JobUpdates in different aurora clusters in order to reconcile them > > later into a single meta update job so to speak. Instead we need to > > generate that ID and keep it in every aurora's JobUpdate > > metadata(JobUpdateRequest.taskConfig). Then in order to get the status > the > > upgrade workflow running in different data centers we have to query all > > recent jobs and based on their metadata content try to filter in ones > that > > we thing belongs to a currently running upgrade for the service. > > > > We propose to change > > struct JobUpdateRequest { > > /** Desired TaskConfig to apply. */ > > 1: TaskConfig taskConfig > > > > /** Desired number of instances of the task config. */ > > 2: i32 instanceCount > > > > /** Update settings and limits. */ > > 3: JobUpdateSettings settings > > > > * /**Optional Job Update key's id, if not specified aurora will generate > > one**/* > > > > * 4: optional string id*} > > > > There is potentially another much more involved solution of supporting > user > > defined metadata mentioned in this ticket: > > https://issues.apache.org/jira/browse/AURORA-1711 > > > > > > - All that brings us to a second problem we had to deal with during > > the upgrade: > > We don't have a good way to manually trigger a job update rollback in > > aurora. The use case is again the same, while running multiple update > jobs > > in different aurora clusters we have a real production requirement to > start > > rolling back update jobs if things are misbehaving and the nature of this > > misbehavior could be potentially very complex. Currently we abort the job > > update and start a new one that would essentially roll cluster forward > to a > > previously run version of the software. > > > > We propose a new convenience API to rollback a running or complete > > JobUpdate: > > > > * /**Rollback job update. */* > > * Response rollbackJobUpdate(* > > * /** The update to rollback. */* > > * 1: JobUpdateKey key,* > > * /** A user-specified message to include with the induced job > update > > state change. */* > > * 3: string message)* > > > > 2. The next problem is related to the way we collect service cluster > > status. I couldn't find a way to quickly get latest statuses for all > > instances/shards for a job in one query. Instead we query all task > statuses > > for a job, them manually iterate through all the statuses and filter the > > latest one in grouped by instance ids. For services with lots of churn on > > tasks statuses that means huge blobs of thrift transferred every time we > > issue a query. I was thinking adding something in this line: > > struct TaskQuery { > > // TODO(maxim): Remove in 0.7.0. (AURORA-749) > > 8: Identity owner > > 14: string role > > 9: string environment > > 2: string jobName > > 4: set<string> taskIds > > 5: set<ScheduleStatus> statuses > > 7: set<i32> instanceIds > > 10: set<string> slaveHosts > > 11: set<JobKey> jobKeys > > 12: i32 offset > > 13: i32 limit > > * 14: i32 limit_per_instance* > > } > > > > but I'm less certain on API here so any help would be welcome. > > > > All the changes we propose would be backward compatible. > > > > -- > > -Igor > > > > > -- -Igor