[ https://issues.apache.org/jira/browse/AURORA-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Renan DelValle updated AURORA-1780: ----------------------------------- Summary: Offers with unknown resources types to Aurora crash the scheduler (was: Offers with unknown resources to Aurora crash the scheduler) > Offers with unknown resources types to Aurora crash the scheduler > ----------------------------------------------------------------- > > Key: AURORA-1780 > URL: https://issues.apache.org/jira/browse/AURORA-1780 > Project: Aurora > Issue Type: Bug > Environment: vagrant > Reporter: Renan DelValle > > Taking offers from Agents which have resources that are not known to Aurora > cause the Scheduler to crash. > Steps to reproduce: > {code} > vagrant up > sudo service mesos-slave stop > echo > "cpus(aurora-role):0.5;cpus(*):3.5;mem(aurora-role):1024;disk:20000;gpus(*):4;test:200" > | sudo tee /etc/mesos-slave/resources > sudo rm -f /var/lib/mesos/meta/slaves/latest > sudo service mesos-slave start > {code} > Wait around a few moments for the offer to be made to Aurora > {code} > I0922 02:41:57.839 [Thread-19, MesosSchedulerImpl:142] Received notification > of lost agent: value: "cadaf569-171d-42fc-a417-fbd608ea5bab-S0" > I0922 02:42:30.585597 2999 log.cpp:577] Attempting to append 109 bytes to > the log > I0922 02:42:30.585654 2999 coordinator.cpp:348] Coordinator attempting to > write APPEND action at position 4 > I0922 02:42:30.585747 2999 replica.cpp:537] Replica received write request > for position 4 from (10)@192.168.33.7:8083 > I0922 02:42:30.586858 2999 leveldb.cpp:341] Persisting action (125 bytes) to > leveldb took 1.086601ms > I0922 02:42:30.586897 2999 replica.cpp:712] Persisted action at 4 > I0922 02:42:30.587020 2999 replica.cpp:691] Replica received learned notice > for position 4 from @0.0.0.0:0 > I0922 02:42:30.587785 2999 leveldb.cpp:341] Persisting action (127 bytes) to > leveldb took 746999ns > I0922 02:42:30.587805 2999 replica.cpp:712] Persisted action at 4 > I0922 02:42:30.587811 2999 replica.cpp:697] Replica learned APPEND action at > position 4 > I0922 02:42:30.601 [SchedulerImpl-0, OfferManager$OfferManagerImpl:185] > Returning offers for cadaf569-171d-42fc-a417-fbd608ea5bab-S1 for compaction. > Sep 22, 2016 2:42:38 AM > com.google.common.util.concurrent.ServiceManager$ServiceListener failed > SEVERE: Service SlotSizeCounterService [FAILED] has failed in the RUNNING > state. > java.lang.NullPointerException: Unknown Mesos resource: name: "test" > type: SCALAR > scalar { > value: 200.0 > } > role: "*" > at java.util.Objects.requireNonNull(Objects.java:228) > at > org.apache.aurora.scheduler.resources.ResourceType.fromResource(ResourceType.java:355) > at > org.apache.aurora.scheduler.resources.ResourceManager.lambda$static$0(ResourceManager.java:52) > at com.google.common.collect.Iterators$7.computeNext(Iterators.java:675) > at > com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) > at > com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) > at java.util.Iterator.forEachRemaining(Iterator.java:115) > at > java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) > at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) > at > java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) > at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) > at > org.apache.aurora.scheduler.resources.ResourceManager.bagFromResources(ResourceManager.java:274) > at > org.apache.aurora.scheduler.resources.ResourceManager.bagFromMesosResources(ResourceManager.java:239) > at > org.apache.aurora.scheduler.stats.AsyncStatsModule$OfferAdapter.get(AsyncStatsModule.java:153) > at > org.apache.aurora.scheduler.stats.SlotSizeCounter.run(SlotSizeCounter.java:168) > at > org.apache.aurora.scheduler.stats.AsyncStatsModule$SlotSizeCounterService.runOneIteration(AsyncStatsModule.java:130) > at > com.google.common.util.concurrent.AbstractScheduledService$ServiceDelegate$Task.run(AbstractScheduledService.java:189) > at com.google.common.util.concurrent.Callables$3.run(Callables.java:100) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > E0922 02:42:38.353 [SlotSizeCounterService RUNNING, > GuavaUtils$LifecycleShutdownListener:55] Service: SlotSizeCounterService > [FAILED] failed unexpectedly. Triggering shutdown. > I0922 02:42:38.353 [SlotSizeCounterService RUNNING, Lifecycle:84] Shutting > down application > I0922 02:42:38.354 [SlotSizeCounterService RUNNING, > ShutdownRegistry$ShutdownRegistryImpl:77] Executing 4 shutdown commands. > I0922 02:42:38.356 [SlotSizeCounterService RUNNING, StateMachine$Builder:389] > SchedulerLifecycle state machine transition ACTIVE -> DEAD > I0922 02:42:38.373028 4029 sched.cpp:1987] Asked to stop the driver > I0922 02:42:38.373152 3000 sched.cpp:1187] Stopping framework > 'cadaf569-171d-42fc-a417-fbd608ea5bab-0000' > I0922 02:42:38.373 [BlockingDriverJoin, SchedulerLifecycle$6:267] Driver > exited, terminating lifecycle. > I0922 02:42:38.374 [BlockingDriverJoin, StateMachine$Builder:389] > SchedulerLifecycle state machine transition DEAD -> DEAD > I0922 02:42:38.374 [BlockingDriverJoin, SchedulerLifecycle$7:287] Shutdown > already invoked, ignoring extra call. > I0922 02:42:38.375 [SlotSizeCounterService RUNNING, StateMachine$Builder:389] > storage state machine transition READY -> STOPPED > I0922 02:42:38.392 [CronLifecycle STOPPING, CronLifecycle:90] Shutting down > Quartz cron scheduler. > I0922 02:42:38.392 [CronLifecycle STOPPING, QuartzScheduler:694] Scheduler > QuartzScheduler_$_aurora-cron-1 shutting down. > I0922 02:42:38.392 [CronLifecycle STOPPING, QuartzScheduler:613] Scheduler > QuartzScheduler_$_aurora-cron-1 paused. > I0922 02:42:38.394 [CronLifecycle STOPPING, QuartzScheduler:771] Scheduler > QuartzScheduler_$_aurora-cron-1 shutdown complete. > W0922 02:42:43.450 [SlotSizeCounterService RUNNING, > ShutdownRegistry$ShutdownRegistryImpl:87] Shutdown action failed. > java.util.concurrent.TimeoutException: Timeout waiting for the services to > stop. The following services have not stopped: > {STOPPING=[TaskGroupBatchWorker [STOPPING], TaskEventBatchWorker [STOPPING], > CronBatchWorker [STOPPING]]} > at > com.google.common.util.concurrent.ServiceManager$ServiceManagerState.awaitStopped(ServiceManager.java:571) > [guava-19.0.jar:na] > at > com.google.common.util.concurrent.ServiceManager.awaitStopped(ServiceManager.java:349) > ~[guava-19.0.jar:na] > at org.apache.aurora.GuavaUtils$1.awaitStopped(GuavaUtils.java:139) > ~[aurora-0.17.0-SNAPSHOT.jar:na] > at > org.apache.aurora.scheduler.SchedulerLifecycle$3.execute(SchedulerLifecycle.java:221) > ~[aurora-0.17.0-SNAPSHOT.jar:na] > at > org.apache.aurora.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute(ShutdownRegistry.java:85) > ~[commons-0.17.0-SNAPSHOT.jar:na] > at > org.apache.aurora.common.application.Lifecycle.shutdown(Lifecycle.java:85) > [commons-0.17.0-SNAPSHOT.jar:na] > at > org.apache.aurora.GuavaUtils$LifecycleShutdownListener.failure(GuavaUtils.java:56) > [aurora-0.17.0-SNAPSHOT.jar:na] > at > com.google.common.util.concurrent.ServiceManager$ServiceManagerState$2.call(ServiceManager.java:695) > [guava-19.0.jar:na] > at > com.google.common.util.concurrent.ServiceManager$ServiceManagerState$2.call(ServiceManager.java:693) > [guava-19.0.jar:na] > at > com.google.common.util.concurrent.ListenerCallQueue.run(ListenerCallQueue.java:118) > [guava-19.0.jar:na] > at > com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:456) > [guava-19.0.jar:na] > at > com.google.common.util.concurrent.ListenerCallQueue.execute(ListenerCallQueue.java:86) > [guava-19.0.jar:na] > at > com.google.common.util.concurrent.ServiceManager$ServiceManagerState.executeListeners(ServiceManager.java:706) > [guava-19.0.jar:na] > at > com.google.common.util.concurrent.ServiceManager$ServiceManagerState.transitionService(ServiceManager.java:677) > [guava-19.0.jar:na] > at > com.google.common.util.concurrent.ServiceManager$ServiceListener.failed(ServiceManager.java:781) > [guava-19.0.jar:na] > at > com.google.common.util.concurrent.AbstractService$5.call(AbstractService.java:509) > [guava-19.0.jar:na] > at > com.google.common.util.concurrent.AbstractService$5.call(AbstractService.java:507) > [guava-19.0.jar:na] > at > com.google.common.util.concurrent.ListenerCallQueue.run(ListenerCallQueue.java:118) > [guava-19.0.jar:na] > at > com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:456) > [guava-19.0.jar:na] > at > com.google.common.util.concurrent.ListenerCallQueue.execute(ListenerCallQueue.java:86) > [guava-19.0.jar:na] > at > com.google.common.util.concurrent.AbstractService.executeListeners(AbstractService.java:458) > [guava-19.0.jar:na] > at > com.google.common.util.concurrent.AbstractService.notifyFailed(AbstractService.java:407) > [guava-19.0.jar:na] > at > com.google.common.util.concurrent.AbstractScheduledService$ServiceDelegate$Task.run(AbstractScheduledService.java:197) > [guava-19.0.jar:na] > at > com.google.common.util.concurrent.Callables$3.run(Callables.java:100) > [guava-19.0.jar:na] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > [na:1.8.0_91] > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > [na:1.8.0_91] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > [na:1.8.0_91] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > [na:1.8.0_91] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [na:1.8.0_91] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [na:1.8.0_91] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91] > I0922 02:42:43.453 [Curator-Framework-0, CuratorFrameworkImpl:821] > backgroundOperationsLoop exiting > I0922 02:42:43.458 [main-EventThread, ClientCnxn$EventThread:519] EventThread > shut down for session: 0x1574fa2880c000a > I0922 02:42:43.458 [SlotSizeCounterService RUNNING, ZooKeeper:684] Session: > 0x1574fa2880c000a closed > I0922 02:42:43.459 [main, SchedulerMain:101] Stopping scheduler services. > I0922 02:42:43.459 [TimeSeriesRepositoryImpl STOPPING, > TimeSeriesRepositoryImpl:168] Variable sampler shut down > I0922 02:42:43.462 [TearDownShutdownRegistry STOPPING, > ShutdownRegistry$ShutdownRegistryImpl:95] Action controller has already > completed, subsequent calls ignored. > I0922 02:42:43.464 [HttpServerLauncher STOPPING, > JettyServerModule$HttpServerLauncher:413] Shutting down embedded http server > I0922 02:42:43.470 [HttpServerLauncher STOPPING, AbstractConnector:310] > Stopped ServerConnector@5c738e5{HTTP/1.1,[http/1.1]}{0.0.0.0:8081} > I0922 02:42:43.473 [HttpServerLauncher STOPPING, ContextHandler:910] Stopped > o.e.j.s.ServletContextHandler@5edf9dde{/,null,UNAVAILABLE} > I0922 02:42:43.476 [main, SchedulerMain:187] Application run() exited. > 2016-09-22 > 02:42:58,164:2942(0x7f307f9b5700):ZOO_WARN@zookeeper_interest@1570: Exceeded > deadline by 11ms > E0922 02:43:38.369 [SlaStat-0, AsyncUtil:159] > java.util.concurrent.ExecutionException: > org.apache.aurora.scheduler.storage.Storage$TransientStorageException: > Storage is not READY java.util.concurrent.ExecutionException: > org.apache.aurora.scheduler.storage.Storage$TransientStorageException: > Storage is not READY > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > ~[na:1.8.0_91] > at java.util.concurrent.FutureTask.get(FutureTask.java:192) > ~[na:1.8.0_91] > at > org.apache.aurora.scheduler.base.AsyncUtil.evaluateResult(AsyncUtil.java:154) > [aurora-0.17.0-SNAPSHOT.jar:na] > at > org.apache.aurora.scheduler.base.AsyncUtil.access$000(AsyncUtil.java:35) > [aurora-0.17.0-SNAPSHOT.jar:na] > at > org.apache.aurora.scheduler.base.AsyncUtil$1.afterExecute(AsyncUtil.java:65) > [aurora-0.17.0-SNAPSHOT.jar:na] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1150) > [na:1.8.0_91] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [na:1.8.0_91] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91] > Caused by: > org.apache.aurora.scheduler.storage.Storage$TransientStorageException: > Storage is not READY > at > org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.checkInState(CallOrderEnforcingStorage.java:79) > ~[aurora-0.17.0-SNAPSHOT.jar:na] > at > org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.read(CallOrderEnforcingStorage.java:112) > ~[aurora-0.17.0-SNAPSHOT.jar:na] > at > org.apache.aurora.scheduler.storage.Storage$Util.fetchTasks(Storage.java:299) > ~[aurora-0.17.0-SNAPSHOT.jar:na] > at > org.apache.aurora.scheduler.sla.MetricCalculator.run(MetricCalculator.java:183) > ~[aurora-0.17.0-SNAPSHOT.jar:na] > at > org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83) > ~[commons-0.17.0-SNAPSHOT.jar:na] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > ~[na:1.8.0_91] > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > ~[na:1.8.0_91] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > ~[na:1.8.0_91] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > ~[na:1.8.0_91] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [na:1.8.0_91] > ... 2 common frames omitted > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)