There are plenty of factors that may contribute towards the behavior you're observing. Based on the logs though it appears you are using DBTaskStore (-use_beta_db_task_store=true)? If so, you may want to revert to the default in-mem task store (-use_beta_db_task_store=false) as DBTaskStore is known to perform subpar on large task counts. This is a known issue and we plan to invest into making it faster.
On Thu, Jun 9, 2016 at 6:58 AM, Erb, Stephan <stephan....@blue-yonder.com> wrote: > I am no expert here, but I would assume that slow task store operations could > result from a slow replicated log. Have you tried keeping it on an SSD? > (https://github.com/apache/aurora/blob/e89521f1eebd9a5301eb02e2ed6ffebdecd54c9a/docs/operations/configuration.md#-native_log_file_path) > > FWIW, there was a recent RB by Maxim to reduce Master load unter task > reconciliation: https://reviews.apache.org/r/47373/diff/2#index_header > ________________________________________ > From: Shyam Patel <sham.pate...@gmail.com> > Sent: Thursday, June 9, 2016 07:48 > To: dev@aurora.apache.org > Subject: Re: Aurora performance impact with hourly query runs > > Hi Bill, > > Cluster Set up : AWS > > 1 Mesos , 1 ZK , 1 Aurora instance : 4 CPU, 16G mem > > Aurora : Xmx 14G > > 100 nodes agent cluster : 40 CPU, 160G mem each > > 8000 Jobs, each with 2 instances. So, total ~16K containers > > > Thanks, > Sham > > > >> On Jun 8, 2016, at 9:18 PM, Bill Farner <wfar...@apache.org> wrote: >> >> Can you give some insight into the machine specs and JVM options used? >> >> Also, is it 8000 jobs or tasks? The terms are often mixed up, but will >> have a big difference here. >> >> On Wednesday, June 8, 2016, Shyam Patel <sham.pate...@gmail.com> wrote: >> >>> Hi, >>> >>> While running LnP testing, I’m spinning of 8K docker jobs. During the run, >>> I ran into issue where TaskStatUpdate and TaskReconciler queries taking >>> real long times. During the time, Aurora is pretty much freezing and at a >>> point dying. Also, tried the same run w/o the docker jobs and faced the >>> same issue. >>> >>> >>> Is there a way to keep the Aurora performance intact during the query runs >>> ? >>> >>> >>> >>> Here is snipped from log : >>> >>> >>> I0602 00:53:37.527 [TaskStatUpdaterService RUNNING, DbTaskStore:104] Query >>> took 1243517 ms: TaskQuery(owner:null, role:null, environment:null, >>> jobName:null, taskIds:null, statuses:[STARTING, THROTTLED, RUNNING, >>> DRAINING, ASSIGNED, KILLING, RESTARTING, PENDING, PREEMPTING], >>> instanceIds:null, slaveHosts:null, jobKeys:null, offset:0, limit:0) >>> >>> >>> I0602 00:56:54.180 [TaskReconciler-0, DbTaskStore:104] Query took 1380169 >>> ms: TaskQuery(owner:null, role:null, environment:null, jobName:null, >>> taskIds:null, statuses:[STARTING, RUNNING, DRAINING, ASSIGNED, KILLING, >>> RESTARTING, PREEMPTING], instanceIds:null, slaveHosts:null, jobKeys:null, >>> offset:0, limit:0) >>> >>> >>> >>> Appreciate any insights.. >>> >>> >>> Thanks, >>> Sham >>> >>>