Bhuvan Arumugam created AURORA-420:
--------------------------------------

             Summary: scheduler crash due to corrupt replica data?
                 Key: AURORA-420
                 URL: https://issues.apache.org/jira/browse/AURORA-420
             Project: Aurora
          Issue Type: Bug
          Components: Scheduler
    Affects Versions: 0.6.0
            Reporter: Bhuvan Arumugam


We are using latest as of 
https://github.com/apache/incubator-aurora/commit/90423243977f141002319f9cd4bd59bcee33aefe.
 Technically it's 0.5.1-snapshot.

The scheduler seem to crash due to corrupt data in replica. It had crashed 
twice in last 2 days. Here is the log snippet.

Last time when we started scheduler after similar crash, all jobs were lost. We 
were running around 30 apps in different slaves during the crash. The apps are 
still running in slaves though. The slaves are shown as running master ui. The 
scheduler seem to have trouble reconnecting to the running tasks when it come 
back online. FWIW, we are not using checkpoint.

Can you let me know?
  1. how to prevent the crashes?
  2. how to recover jobs from replica backup?

{code}
I0513 15:07:39.982774 25560 log.cpp:680] Attempting to append 125 bytes to the 
log
I0513 15:07:39.982879 25545 coordinator.cpp:340] Coordinator attempting to 
write APPEND action at position 29779
I0513 15:07:39.983695 25543 replica.cpp:508] Replica received write request for 
position 29779
I0513 15:07:39.986923 25543 leveldb.cpp:341] Persisting action (144 bytes) to 
leveldb took 3.177192ms
I0513 15:07:39.986961 25543 replica.cpp:676] Persisted action at 29779
I0513 15:07:39.987192 25543 replica.cpp:655] Replica received learned notice 
for position 29779
I0513 15:07:39.989861 25543 leveldb.cpp:341] Persisting action (146 bytes) to 
leveldb took 2.637372ms
I0513 15:07:39.989895 25543 replica.cpp:676] Persisted action at 29779
I0513 15:07:39.989907 25543 replica.cpp:661] Replica learned APPEND action at 
position 29779
I0513 22:07:46.621 THREAD5299 
org.apache.aurora.scheduler.async.OfferQueue$OfferQueueImpl.addOffer: Returning 
offers for 20140512-151150-360689681-5050-7152-6 for compaction.
I0513 22:08:39.641 THREAD5301 
org.apache.aurora.scheduler.async.OfferQueue$OfferQueueImpl.addOffer: Returning 
offers for 20140512-151150-360689681-5050-7152-9 for compaction.
I0513 22:10:20.474 THREAD29 
org.apache.aurora.scheduler.SchedulerLifecycle$6$4.run: Triggering automatic 
failover.
I0513 22:10:20.475 THREAD29 
com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle 
state machine transition ACTIVE -> DEAD
I0513 15:10:20.486500 25562 sched.cpp:731] Stopping framework 
'2014-03-26-13:02:35-360689681-5050-31080-0000'
I0513 22:10:20.486 THREAD29 
com.twitter.common.util.StateMachine$Builder$1.execute: storage state machine 
transition READY -> STOPPED
W0513 22:10:20.486 THREAD24 
com.twitter.common.zookeeper.ServerSetImpl$ServerSetWatcher.notifyServerSetChange:
 server set empty for path /aurora/scheduler
I0513 22:10:20.486 THREAD31 
com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle 
state machine transition DEAD -> DEAD
I0513 22:10:20.486 THREAD29 com.twitter.common.application.Lifecycle.shutdown: 
Shutting down application
I0513 22:10:20.487 THREAD31 
org.apache.aurora.scheduler.SchedulerLifecycle$8.execute: Shutdown already 
invoked, ignoring extra call.
W0513 22:10:20.486 THREAD24 
org.apache.aurora.scheduler.http.LeaderRedirect$SchedulerMonitor.onChange: No 
schedulers in host set, will not redirect despite not being leader.
I0513 22:10:20.487 THREAD29 
com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute: 
Executing 8 shutdown commands.
W0513 22:10:20.488 THREAD24 
com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange: All candidates have 
temporarily left the group: Group /aurora/scheduler
E0513 22:10:20.488 THREAD24 
org.apache.aurora.scheduler.SchedulerLifecycle$SchedulerCandidateImpl.onDefeated:
 Lost leadership, committing suicide.
I0513 22:10:20.489 THREAD24 
com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle 
state machine transition DEAD -> DEAD
I0513 22:10:20.489 THREAD24 
org.apache.aurora.scheduler.SchedulerLifecycle$8.execute: Shutdown already 
invoked, ignoring extra call.
I0513 22:10:20.491 THREAD29 
org.apache.aurora.scheduler.app.AppModule$RegisterShutdownStackPrinter$2.execute:
 Shutdown initiated by: Thread: Lifecycle-0 (id 29)
java.lang.Thread.getStackTrace(Thread.java:1588)
  
org.apache.aurora.scheduler.app.AppModule$RegisterShutdownStackPrinter$2.execute(AppModule.java:151)
  
com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute(ShutdownRegistry.java:88)
  com.twitter.common.application.Lifecycle.shutdown(Lifecycle.java:92)
  
org.apache.aurora.scheduler.SchedulerLifecycle$8.execute(SchedulerLifecycle.java:382)
  
org.apache.aurora.scheduler.SchedulerLifecycle$8.execute(SchedulerLifecycle.java:354)
  com.twitter.common.base.Closures$4.execute(Closures.java:120)
  com.twitter.common.base.Closures$3.execute(Closures.java:98)
  com.twitter.common.util.StateMachine.transition(StateMachine.java:191)
  
org.apache.aurora.scheduler.SchedulerLifecycle$6$4.run(SchedulerLifecycle.java:287)
  java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
  java.util.concurrent.FutureTask.run(FutureTask.java:262)
  
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
  
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
  
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  java.lang.Thread.run(Thread.java:744)
I0513 22:10:20.491 THREAD29 
com.twitter.common.stats.TimeSeriesRepositoryImpl$3.execute: Variable sampler 
shut down
I0513 22:10:20.491 THREAD29 
org.apache.aurora.scheduler.thrift.ThriftServerLauncher$1.execute: Stopping 
thrift server.
I0513 22:10:20.491 THREAD29 
org.apache.aurora.scheduler.thrift.ThriftServer.shutdown: Received shutdown 
request, stopping server.
I0513 22:10:20.491 THREAD29 
org.apache.aurora.scheduler.thrift.ThriftServer.setStatus: Moving from status 
ALIVE to STOPPING
I0513 22:10:20.492 THREAD29 
org.apache.aurora.scheduler.thrift.ThriftServer.setStatus: Moving from status 
STOPPING to STOPPED
I0513 22:10:20.492 THREAD29 
com.twitter.common.application.modules.HttpModule$HttpServerLauncher$1.execute: 
Shutting down embedded http server
I0513 22:10:20.492 THREAD29 org.mortbay.log.Slf4jLog.info: Stopped 
[email protected]:8081
I0513 22:10:20.594 THREAD29 
com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle 
state machine transition DEAD -> DEAD
I0513 22:10:20.594 THREAD29 
org.apache.aurora.scheduler.SchedulerLifecycle$8.execute: Shutdown already 
invoked, ignoring extra call.
I0513 22:10:20.595 THREAD1 com.twitter.common.application.AppLauncher.run: 
Application run() exited.

{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to