[
https://issues.apache.org/jira/browse/AURORA-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998976#comment-13998976
]
Bhuvan Arumugam commented on AURORA-420:
----------------------------------------
Looks like we should still use {{mesos-log}} for replication, going by
{{vagrant/provision-dev-cluster.sh}} script. The problem is more like
{{mesos-log}} command in v0.19.0 is unable to parse replica created using older
version, in this case v0.18.0.
> scheduler crash due to corrupt replica data?
> --------------------------------------------
>
> Key: AURORA-420
> URL: https://issues.apache.org/jira/browse/AURORA-420
> Project: Aurora
> Issue Type: Bug
> Components: Scheduler
> Affects Versions: 0.6.0
> Reporter: Bhuvan Arumugam
>
> We are using latest as of
> https://github.com/apache/incubator-aurora/commit/90423243977f141002319f9cd4bd59bcee33aefe.
> Technically it's 0.5.1-snapshot.
> The scheduler seem to crash due to corrupt data in replica. It had crashed
> twice in last 2 days. Here is the log snippet.
> Last time when we started scheduler after similar crash, all jobs were lost.
> We were running around 30 apps in different slaves during the crash. The apps
> are still running in slaves though. The slaves are shown as running master
> ui. The scheduler seem to have trouble reconnecting to the running tasks when
> it come back online. FWIW, we are not using checkpoint.
> Can you let me know?
> 1. how to prevent the crashes?
> 2. how to recover jobs from replica backup?
> {code}
> I0513 15:07:39.982774 25560 log.cpp:680] Attempting to append 125 bytes to
> the log
> I0513 15:07:39.982879 25545 coordinator.cpp:340] Coordinator attempting to
> write APPEND action at position 29779
> I0513 15:07:39.983695 25543 replica.cpp:508] Replica received write request
> for position 29779
> I0513 15:07:39.986923 25543 leveldb.cpp:341] Persisting action (144 bytes) to
> leveldb took 3.177192ms
> I0513 15:07:39.986961 25543 replica.cpp:676] Persisted action at 29779
> I0513 15:07:39.987192 25543 replica.cpp:655] Replica received learned notice
> for position 29779
> I0513 15:07:39.989861 25543 leveldb.cpp:341] Persisting action (146 bytes) to
> leveldb took 2.637372ms
> I0513 15:07:39.989895 25543 replica.cpp:676] Persisted action at 29779
> I0513 15:07:39.989907 25543 replica.cpp:661] Replica learned APPEND action at
> position 29779
> I0513 22:07:46.621 THREAD5299
> org.apache.aurora.scheduler.async.OfferQueue$OfferQueueImpl.addOffer:
> Returning offers for 20140512-151150-360689681-5050-7152-6 for compaction.
> I0513 22:08:39.641 THREAD5301
> org.apache.aurora.scheduler.async.OfferQueue$OfferQueueImpl.addOffer:
> Returning offers for 20140512-151150-360689681-5050-7152-9 for compaction.
> I0513 22:10:20.474 THREAD29
> org.apache.aurora.scheduler.SchedulerLifecycle$6$4.run: Triggering automatic
> failover.
> I0513 22:10:20.475 THREAD29
> com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle
> state machine transition ACTIVE -> DEAD
> I0513 15:10:20.486500 25562 sched.cpp:731] Stopping framework
> '2014-03-26-13:02:35-360689681-5050-31080-0000'
> I0513 22:10:20.486 THREAD29
> com.twitter.common.util.StateMachine$Builder$1.execute: storage state machine
> transition READY -> STOPPED
> W0513 22:10:20.486 THREAD24
> com.twitter.common.zookeeper.ServerSetImpl$ServerSetWatcher.notifyServerSetChange:
> server set empty for path /aurora/scheduler
> I0513 22:10:20.486 THREAD31
> com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle
> state machine transition DEAD -> DEAD
> I0513 22:10:20.486 THREAD29
> com.twitter.common.application.Lifecycle.shutdown: Shutting down application
> I0513 22:10:20.487 THREAD31
> org.apache.aurora.scheduler.SchedulerLifecycle$8.execute: Shutdown already
> invoked, ignoring extra call.
> W0513 22:10:20.486 THREAD24
> org.apache.aurora.scheduler.http.LeaderRedirect$SchedulerMonitor.onChange: No
> schedulers in host set, will not redirect despite not being leader.
> I0513 22:10:20.487 THREAD29
> com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute:
> Executing 8 shutdown commands.
> W0513 22:10:20.488 THREAD24
> com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange: All candidates
> have temporarily left the group: Group /aurora/scheduler
> E0513 22:10:20.488 THREAD24
> org.apache.aurora.scheduler.SchedulerLifecycle$SchedulerCandidateImpl.onDefeated:
> Lost leadership, committing suicide.
> I0513 22:10:20.489 THREAD24
> com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle
> state machine transition DEAD -> DEAD
> I0513 22:10:20.489 THREAD24
> org.apache.aurora.scheduler.SchedulerLifecycle$8.execute: Shutdown already
> invoked, ignoring extra call.
> I0513 22:10:20.491 THREAD29
> org.apache.aurora.scheduler.app.AppModule$RegisterShutdownStackPrinter$2.execute:
> Shutdown initiated by: Thread: Lifecycle-0 (id 29)
> java.lang.Thread.getStackTrace(Thread.java:1588)
>
> org.apache.aurora.scheduler.app.AppModule$RegisterShutdownStackPrinter$2.execute(AppModule.java:151)
>
> com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute(ShutdownRegistry.java:88)
> com.twitter.common.application.Lifecycle.shutdown(Lifecycle.java:92)
>
> org.apache.aurora.scheduler.SchedulerLifecycle$8.execute(SchedulerLifecycle.java:382)
>
> org.apache.aurora.scheduler.SchedulerLifecycle$8.execute(SchedulerLifecycle.java:354)
> com.twitter.common.base.Closures$4.execute(Closures.java:120)
> com.twitter.common.base.Closures$3.execute(Closures.java:98)
> com.twitter.common.util.StateMachine.transition(StateMachine.java:191)
>
> org.apache.aurora.scheduler.SchedulerLifecycle$6$4.run(SchedulerLifecycle.java:287)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> java.util.concurrent.FutureTask.run(FutureTask.java:262)
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:744)
> I0513 22:10:20.491 THREAD29
> com.twitter.common.stats.TimeSeriesRepositoryImpl$3.execute: Variable sampler
> shut down
> I0513 22:10:20.491 THREAD29
> org.apache.aurora.scheduler.thrift.ThriftServerLauncher$1.execute: Stopping
> thrift server.
> I0513 22:10:20.491 THREAD29
> org.apache.aurora.scheduler.thrift.ThriftServer.shutdown: Received shutdown
> request, stopping server.
> I0513 22:10:20.491 THREAD29
> org.apache.aurora.scheduler.thrift.ThriftServer.setStatus: Moving from status
> ALIVE to STOPPING
> I0513 22:10:20.492 THREAD29
> org.apache.aurora.scheduler.thrift.ThriftServer.setStatus: Moving from status
> STOPPING to STOPPED
> I0513 22:10:20.492 THREAD29
> com.twitter.common.application.modules.HttpModule$HttpServerLauncher$1.execute:
> Shutting down embedded http server
> I0513 22:10:20.492 THREAD29 org.mortbay.log.Slf4jLog.info: Stopped
> [email protected]:8081
> I0513 22:10:20.594 THREAD29
> com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle
> state machine transition DEAD -> DEAD
> I0513 22:10:20.594 THREAD29
> org.apache.aurora.scheduler.SchedulerLifecycle$8.execute: Shutdown already
> invoked, ignoring extra call.
> I0513 22:10:20.595 THREAD1 com.twitter.common.application.AppLauncher.run:
> Application run() exited.
> {code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)