Bhuvan Arumugam created AURORA-897:
--------------------------------------
Summary: aurora scheduler unable to restore from snapshot
Key: AURORA-897
URL: https://issues.apache.org/jira/browse/AURORA-897
Project: Aurora
Issue Type: Bug
Components: Scheduler
Affects Versions: 0.5.0
Reporter: Bhuvan Arumugam
Since this weekend, we are unable to upgrade scheduler. It fail as it couldn't
restore replicated log from snapshot. See following traceback for details.
I couldn't replicate it in vagrant that have >2 jobs. It happen in one of our
test clusters, running with 1 scheduler. I could reproduce this problem, with
restoring the snapshot manually using {{aurora_admin scheduler_stage_recovery}}
command. The backup size is 1.5M, if it matter. I can share the snapshot if
it'll help to debug.
Scheduler works as of this commit:
https://github.com/apache/incubator-aurora/commit/f2d0ca299ad759e9720a5a82547f7bce8a427d53
Scheduler DOES NOT work as of this commit:
https://github.com/apache/incubator-aurora/commit/3778330cd37bc8b96907654b496c15152dfecea7
Commit delta:
https://github.com/apache/incubator-aurora/compare/f2d0ca299ad759e9720a5a82547f7bce8a427d53...3778330cd37bc8b96907654b496c15152dfecea7
I suspect one of commits related to AURORA-84 have caused this regression. The
StorageBackfill class was modified to fix that jira.
{code}
I1025 16:07:31.336 THREAD1
org.apache.aurora.scheduler.storage.log.LogStorage.replay: Applying snapshot
taken on Sat Oct 25 16:05:29 UTC 2014
I1025 16:07:31.339 THREAD1
org.apache.aurora.scheduler.storage.log.SnapshotStoreImpl$9.execute: Restoring
snapshot.
I1025 16:07:31.891233 37970 leveldb.cpp:438] Reading position from leveldb took
254951ns
I1025 16:07:31.893616 37976 leveldb.cpp:438] Reading position from leveldb took
110726ns
I1025 16:07:31.909817 37971 leveldb.cpp:438] Reading position from leveldb took
186434ns
I1025 16:07:31.913910 37971 leveldb.cpp:438] Reading position from leveldb took
81185ns
I1025 16:07:31.917501 37983 leveldb.cpp:438] Reading position from leveldb took
31971ns
I1025 16:07:31.919 THREAD1
org.apache.aurora.scheduler.storage.StorageBackfill.backfill: Backfilling task
config job keys.
I1025 16:07:32.023 THREAD1
org.apache.aurora.scheduler.storage.StorageBackfill.backfill: Performing shard
uniqueness sanity check.
E1025 16:07:32.076 THREAD1
org.apache.aurora.scheduler.SchedulerLifecycle$9.execute: Caught unchecked
exception: java.lang.IllegalArgumentException
java.lang.IllegalArgumentException
at
com.google.common.base.Preconditions.checkArgument(Preconditions.java:111)
at
org.apache.aurora.scheduler.base.JobKeys.assertValid(JobKeys.java:100)
at
org.apache.aurora.scheduler.base.Query$Builder.byInstances(Query.java:337)
at org.apache.aurora.scheduler.base.Query.instanceScoped(Query.java:102)
at
org.apache.aurora.scheduler.storage.StorageBackfill.guaranteeShardUniqueness(StorageBackfill.java:74)
at
org.apache.aurora.scheduler.storage.StorageBackfill.access$100(StorageBackfill.java:44)
at
org.apache.aurora.scheduler.storage.StorageBackfill$2.apply(StorageBackfill.java:158)
at
org.apache.aurora.scheduler.storage.StorageBackfill$2.apply(StorageBackfill.java:151)
at
org.apache.aurora.scheduler.storage.mem.MemTaskStore.mutateTasks(MemTaskStore.java:194)
at
com.twitter.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:87)
at
org.apache.aurora.scheduler.storage.log.WriteAheadStorage.mutateTasks(WriteAheadStorage.java:196)
at
org.apache.aurora.scheduler.storage.StorageBackfill.backfill(StorageBackfill.java:151)
at
org.apache.aurora.scheduler.SchedulerLifecycle$6$1.execute(SchedulerLifecycle.java:251)
at
org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:132)
at
org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult$Quiet.apply(Storage.java:149)
at
org.apache.aurora.scheduler.storage.log.LogStorage$7.apply(LogStorage.java:562)
at
org.apache.aurora.scheduler.storage.log.LogStorage$7.apply(LogStorage.java:559)
at
org.apache.aurora.scheduler.storage.mem.MemStorage.doWork(MemStorage.java:175)
at
org.apache.aurora.scheduler.storage.mem.MemStorage.access$400(MemStorage.java:59)
at
org.apache.aurora.scheduler.storage.mem.MemStorage$4.apply(MemStorage.java:200)
at
org.apache.aurora.scheduler.storage.mem.MemStorage$4.apply(MemStorage.java:197)
at
org.apache.aurora.scheduler.storage.db.DbStorage.write(DbStorage.java:150)
at
org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101)
at
org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemStorage.java:197)
at
org.apache.aurora.scheduler.storage.log.LogStorage.write(LogStorage.java:559)
at
org.apache.aurora.scheduler.storage.log.LogStorage$2.execute(LogStorage.java:331)
at
org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:132)
at
org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult$Quiet.apply(Storage.java:149)
at
org.apache.aurora.scheduler.storage.mem.MemStorage.doWork(MemStorage.java:175)
at
org.apache.aurora.scheduler.storage.mem.MemStorage.access$400(MemStorage.java:59)
at
org.apache.aurora.scheduler.storage.mem.MemStorage$4.apply(MemStorage.java:200)
at
org.apache.aurora.scheduler.storage.mem.MemStorage$4.apply(MemStorage.java:197)
at
org.apache.aurora.scheduler.storage.db.DbStorage.write(DbStorage.java:150)
at
org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101)
at
org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemStorage.java:197)
at
org.apache.aurora.scheduler.storage.log.LogStorage.write(LogStorage.java:548)
at
org.apache.aurora.scheduler.storage.log.LogStorage.start(LogStorage.java:320)
at
org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.start(CallOrderEnforcingStorage.java:92)
at
org.apache.aurora.scheduler.SchedulerLifecycle$6.execute(SchedulerLifecycle.java:248)
at
org.apache.aurora.scheduler.SchedulerLifecycle$6.execute(SchedulerLifecycle.java:244)
at com.twitter.common.base.Closures$4.execute(Closures.java:120)
at
org.apache.aurora.scheduler.SchedulerLifecycle$9.execute(SchedulerLifecycle.java:423)
at
org.apache.aurora.scheduler.SchedulerLifecycle$9.execute(SchedulerLifecycle.java:419)
at com.twitter.common.base.Closures$4.execute(Closures.java:120)
at com.twitter.common.base.Closures$3.execute(Closures.java:98)
at
com.twitter.common.util.StateMachine.transition(StateMachine.java:191)
at
org.apache.aurora.scheduler.SchedulerLifecycle$SchedulerCandidateImpl.onLeading(SchedulerLifecycle.java:465)
at
com.twitter.common.zookeeper.SingletonService$1.onElected(SingletonService.java:168)
at
com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange(CandidateImpl.java:155)
at
com.twitter.common.zookeeper.Group$GroupMonitor.setMembers(Group.java:665)
at
com.twitter.common.zookeeper.Group$GroupMonitor.watchGroup(Group.java:638)
at
com.twitter.common.zookeeper.Group$GroupMonitor.access$900(Group.java:579)
at com.twitter.common.zookeeper.Group$4.get(Group.java:551)
at com.twitter.common.zookeeper.Group$4.get(Group.java:548)
at com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:109)
at com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:107)
at
com.twitter.common.util.BackoffHelper.doUntilResult(BackoffHelper.java:127)
at
com.twitter.common.util.BackoffHelper.doUntilSuccess(BackoffHelper.java:107)
at com.twitter.common.zookeeper.Group.watch(Group.java:548)
at
com.twitter.common.zookeeper.CandidateImpl.offerLeadership(CandidateImpl.java:136)
at
com.twitter.common.zookeeper.SingletonService.lead(SingletonService.java:165)
at
org.apache.aurora.scheduler.app.SchedulerMain.run(SchedulerMain.java:253)
at com.twitter.common.application.AppLauncher.run(AppLauncher.java:102)
at
com.twitter.common.application.AppLauncher.launch(AppLauncher.java:181)
at
com.twitter.common.application.AppLauncher.launch(AppLauncher.java:142)
at
org.apache.aurora.scheduler.app.SchedulerMain.main(SchedulerMain.java:269)
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)