The query performance improved drastically, It took only 29ms for 12K jobs/30K tasks.. (from an hour !)
Thanks Maxim for quick lead, really appreciate your help. Thanks, Sham > On Jun 9, 2016, at 10:06 AM, Maxim Khutornenko <ma...@apache.org> wrote: > > Scheduler persists its state in the Mesos replicated log regardless of > the in-memory engine. If you change the flag and restart scheduler all > tasks are going to be re-inserted into MemTaskStore instead of > DBTaskStore. No data will be lost. > > On Thu, Jun 9, 2016 at 9:55 AM, Shyam Patel <sham.pate...@gmail.com> wrote: >> Thanks Maxim, >> >> If we move to mem task store, restart of aurora would lose the data ? (btw, >> I’m running aurora in a container) >> >> >> >>> On Jun 9, 2016, at 8:37 AM, Maxim Khutornenko <ma...@apache.org> wrote: >>> >>> There are plenty of factors that may contribute towards the behavior >>> you're observing. Based on the logs though it appears you are using >>> DBTaskStore (-use_beta_db_task_store=true)? If so, you may want to >>> revert to the default in-mem task store >>> (-use_beta_db_task_store=false) as DBTaskStore is known to perform >>> subpar on large task counts. This is a known issue and we plan to >>> invest into making it faster. >>> >>> On Thu, Jun 9, 2016 at 6:58 AM, Erb, Stephan >>> <stephan....@blue-yonder.com> wrote: >>>> I am no expert here, but I would assume that slow task store operations >>>> could result from a slow replicated log. Have you tried keeping it on an >>>> SSD? >>>> (https://github.com/apache/aurora/blob/e89521f1eebd9a5301eb02e2ed6ffebdecd54c9a/docs/operations/configuration.md#-native_log_file_path) >>>> >>>> FWIW, there was a recent RB by Maxim to reduce Master load unter task >>>> reconciliation: https://reviews.apache.org/r/47373/diff/2#index_header >>>> ________________________________________ >>>> From: Shyam Patel <sham.pate...@gmail.com> >>>> Sent: Thursday, June 9, 2016 07:48 >>>> To: dev@aurora.apache.org >>>> Subject: Re: Aurora performance impact with hourly query runs >>>> >>>> Hi Bill, >>>> >>>> Cluster Set up : AWS >>>> >>>> 1 Mesos , 1 ZK , 1 Aurora instance : 4 CPU, 16G mem >>>> >>>> Aurora : Xmx 14G >>>> >>>> 100 nodes agent cluster : 40 CPU, 160G mem each >>>> >>>> 8000 Jobs, each with 2 instances. So, total ~16K containers >>>> >>>> >>>> Thanks, >>>> Sham >>>> >>>> >>>> >>>>> On Jun 8, 2016, at 9:18 PM, Bill Farner <wfar...@apache.org> wrote: >>>>> >>>>> Can you give some insight into the machine specs and JVM options used? >>>>> >>>>> Also, is it 8000 jobs or tasks? The terms are often mixed up, but will >>>>> have a big difference here. >>>>> >>>>> On Wednesday, June 8, 2016, Shyam Patel <sham.pate...@gmail.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> While running LnP testing, I’m spinning of 8K docker jobs. During the >>>>>> run, >>>>>> I ran into issue where TaskStatUpdate and TaskReconciler queries taking >>>>>> real long times. During the time, Aurora is pretty much freezing and at a >>>>>> point dying. Also, tried the same run w/o the docker jobs and faced the >>>>>> same issue. >>>>>> >>>>>> >>>>>> Is there a way to keep the Aurora performance intact during the query >>>>>> runs >>>>>> ? >>>>>> >>>>>> >>>>>> >>>>>> Here is snipped from log : >>>>>> >>>>>> >>>>>> I0602 00:53:37.527 [TaskStatUpdaterService RUNNING, DbTaskStore:104] >>>>>> Query >>>>>> took 1243517 ms: TaskQuery(owner:null, role:null, environment:null, >>>>>> jobName:null, taskIds:null, statuses:[STARTING, THROTTLED, RUNNING, >>>>>> DRAINING, ASSIGNED, KILLING, RESTARTING, PENDING, PREEMPTING], >>>>>> instanceIds:null, slaveHosts:null, jobKeys:null, offset:0, limit:0) >>>>>> >>>>>> >>>>>> I0602 00:56:54.180 [TaskReconciler-0, DbTaskStore:104] Query took 1380169 >>>>>> ms: TaskQuery(owner:null, role:null, environment:null, jobName:null, >>>>>> taskIds:null, statuses:[STARTING, RUNNING, DRAINING, ASSIGNED, KILLING, >>>>>> RESTARTING, PREEMPTING], instanceIds:null, slaveHosts:null, jobKeys:null, >>>>>> offset:0, limit:0) >>>>>> >>>>>> >>>>>> >>>>>> Appreciate any insights.. >>>>>> >>>>>> >>>>>> Thanks, >>>>>> Sham >>>>>> >>>>>> >>