[jira] [Comment Edited] (AURORA-420) scheduler crash due to corrupt replica data?

Bill Farner (JIRA) Thu, 15 May 2014 17:10:29 -0700

    [ 
https://issues.apache.org/jira/browse/AURORA-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997915#comment-13997915
 ]


Bill Farner edited comment on AURORA-420 at 5/14/14 7:27 PM:
-------------------------------------------------------------

By default, the scheduler will automatically fail over after 24 hours \[1\] of 
leading.  This is due to a limitation in the replicated log, since we have no 
means to trigger LevelDb compaction (see MESOS-184 for more details).

{quote}
Last time when we started scheduler after similar crash, all jobs were lost.
{quote}

This part is troubling, and something we have not seen.  Can you provide more 
details on your setup?
- Is there some sort of supervisor (e.g. monit, upstart) restarting the 
scheduler on exit?
- How many schedulers are running in the cluster?
- Did you override the {{-native_log_quorum_size}} command line argument \[2\]? 
 If so, to what value?

{quote}
We were running around 30 apps in different slaves during the crash. The apps 
are still running in slaves though.
{quote}

When the scheduler restarted, did it appear to have a completely blank 
database, or just stale?  In the master UI, did it show up as a new framework?

\[1\] 
https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/SchedulerModule.java#L54
\[2\] 
https://github.com/apache/incubator-aurora/blob/master/docs/deploying-aurora-scheduler.md


was (Author: wfarner):
By default, the scheduler will automatically fail over after 24 hours [1] of 
leading.  This is due to a limitation in the replicated log, since we have no 
means to trigger LevelDb compaction (see MESOS-184 for more details).

{quote}
Last time when we started scheduler after similar crash, all jobs were lost.
{quote}

This part is troubling, and something we have not seen.  Can you provide more 
details on your setup?
Is there some sort of supervisor (e.g. monit, upstart) restarting the scheduler 
on exit?
How many schedulers are running in the cluster?
Did you override the {{-native_log_quorum_size}} command line argument [2]?  If 
so, to what value?

{quote}
We were running around 30 apps in different slaves during the crash. The apps 
are still running in slaves though.
{quote}

When the scheduler restarted, did it appear to have a completely blank 
database, or just stale?  In the master UI, did it show up as a new framework?

[1] 
https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/SchedulerModule.java#L54
[2] 
https://github.com/apache/incubator-aurora/blob/master/docs/deploying-aurora-scheduler.md

> scheduler crash due to corrupt replica data?
> --------------------------------------------
>
>                 Key: AURORA-420
>                 URL: https://issues.apache.org/jira/browse/AURORA-420
>             Project: Aurora
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 0.6.0
>            Reporter: Bhuvan Arumugam
>
> We are using latest as of 
> https://github.com/apache/incubator-aurora/commit/90423243977f141002319f9cd4bd59bcee33aefe.
>  Technically it's 0.5.1-snapshot.
> The scheduler seem to crash due to corrupt data in replica. It had crashed 
> twice in last 2 days. Here is the log snippet.
> Last time when we started scheduler after similar crash, all jobs were lost. 
> We were running around 30 apps in different slaves during the crash. The apps 
> are still running in slaves though. The slaves are shown as running master 
> ui. The scheduler seem to have trouble reconnecting to the running tasks when 
> it come back online. FWIW, we are not using checkpoint.
> Can you let me know?
>   1. how to prevent the crashes?
>   2. how to recover jobs from replica backup?
> {code}
> I0513 15:07:39.982774 25560 log.cpp:680] Attempting to append 125 bytes to 
> the log
> I0513 15:07:39.982879 25545 coordinator.cpp:340] Coordinator attempting to 
> write APPEND action at position 29779
> I0513 15:07:39.983695 25543 replica.cpp:508] Replica received write request 
> for position 29779
> I0513 15:07:39.986923 25543 leveldb.cpp:341] Persisting action (144 bytes) to 
> leveldb took 3.177192ms
> I0513 15:07:39.986961 25543 replica.cpp:676] Persisted action at 29779
> I0513 15:07:39.987192 25543 replica.cpp:655] Replica received learned notice 
> for position 29779
> I0513 15:07:39.989861 25543 leveldb.cpp:341] Persisting action (146 bytes) to 
> leveldb took 2.637372ms
> I0513 15:07:39.989895 25543 replica.cpp:676] Persisted action at 29779
> I0513 15:07:39.989907 25543 replica.cpp:661] Replica learned APPEND action at 
> position 29779
> I0513 22:07:46.621 THREAD5299 
> org.apache.aurora.scheduler.async.OfferQueue$OfferQueueImpl.addOffer: 
> Returning offers for 20140512-151150-360689681-5050-7152-6 for compaction.
> I0513 22:08:39.641 THREAD5301 
> org.apache.aurora.scheduler.async.OfferQueue$OfferQueueImpl.addOffer: 
> Returning offers for 20140512-151150-360689681-5050-7152-9 for compaction.
> I0513 22:10:20.474 THREAD29 
> org.apache.aurora.scheduler.SchedulerLifecycle$6$4.run: Triggering automatic 
> failover.
> I0513 22:10:20.475 THREAD29 
> com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle 
> state machine transition ACTIVE -> DEAD
> I0513 15:10:20.486500 25562 sched.cpp:731] Stopping framework 
> '2014-03-26-13:02:35-360689681-5050-31080-0000'
> I0513 22:10:20.486 THREAD29 
> com.twitter.common.util.StateMachine$Builder$1.execute: storage state machine 
> transition READY -> STOPPED
> W0513 22:10:20.486 THREAD24 
> com.twitter.common.zookeeper.ServerSetImpl$ServerSetWatcher.notifyServerSetChange:
>  server set empty for path /aurora/scheduler
> I0513 22:10:20.486 THREAD31 
> com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle 
> state machine transition DEAD -> DEAD
> I0513 22:10:20.486 THREAD29 
> com.twitter.common.application.Lifecycle.shutdown: Shutting down application
> I0513 22:10:20.487 THREAD31 
> org.apache.aurora.scheduler.SchedulerLifecycle$8.execute: Shutdown already 
> invoked, ignoring extra call.
> W0513 22:10:20.486 THREAD24 
> org.apache.aurora.scheduler.http.LeaderRedirect$SchedulerMonitor.onChange: No 
> schedulers in host set, will not redirect despite not being leader.
> I0513 22:10:20.487 THREAD29 
> com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute: 
> Executing 8 shutdown commands.
> W0513 22:10:20.488 THREAD24 
> com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange: All candidates 
> have temporarily left the group: Group /aurora/scheduler
> E0513 22:10:20.488 THREAD24 
> org.apache.aurora.scheduler.SchedulerLifecycle$SchedulerCandidateImpl.onDefeated:
>  Lost leadership, committing suicide.
> I0513 22:10:20.489 THREAD24 
> com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle 
> state machine transition DEAD -> DEAD
> I0513 22:10:20.489 THREAD24 
> org.apache.aurora.scheduler.SchedulerLifecycle$8.execute: Shutdown already 
> invoked, ignoring extra call.
> I0513 22:10:20.491 THREAD29 
> org.apache.aurora.scheduler.app.AppModule$RegisterShutdownStackPrinter$2.execute:
>  Shutdown initiated by: Thread: Lifecycle-0 (id 29)
> java.lang.Thread.getStackTrace(Thread.java:1588)
>   
> org.apache.aurora.scheduler.app.AppModule$RegisterShutdownStackPrinter$2.execute(AppModule.java:151)
>   
> com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute(ShutdownRegistry.java:88)
>   com.twitter.common.application.Lifecycle.shutdown(Lifecycle.java:92)
>   
> org.apache.aurora.scheduler.SchedulerLifecycle$8.execute(SchedulerLifecycle.java:382)
>   
> org.apache.aurora.scheduler.SchedulerLifecycle$8.execute(SchedulerLifecycle.java:354)
>   com.twitter.common.base.Closures$4.execute(Closures.java:120)
>   com.twitter.common.base.Closures$3.execute(Closures.java:98)
>   com.twitter.common.util.StateMachine.transition(StateMachine.java:191)
>   
> org.apache.aurora.scheduler.SchedulerLifecycle$6$4.run(SchedulerLifecycle.java:287)
>   java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
>   
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
>   
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   java.lang.Thread.run(Thread.java:744)
> I0513 22:10:20.491 THREAD29 
> com.twitter.common.stats.TimeSeriesRepositoryImpl$3.execute: Variable sampler 
> shut down
> I0513 22:10:20.491 THREAD29 
> org.apache.aurora.scheduler.thrift.ThriftServerLauncher$1.execute: Stopping 
> thrift server.
> I0513 22:10:20.491 THREAD29 
> org.apache.aurora.scheduler.thrift.ThriftServer.shutdown: Received shutdown 
> request, stopping server.
> I0513 22:10:20.491 THREAD29 
> org.apache.aurora.scheduler.thrift.ThriftServer.setStatus: Moving from status 
> ALIVE to STOPPING
> I0513 22:10:20.492 THREAD29 
> org.apache.aurora.scheduler.thrift.ThriftServer.setStatus: Moving from status 
> STOPPING to STOPPED
> I0513 22:10:20.492 THREAD29 
> com.twitter.common.application.modules.HttpModule$HttpServerLauncher$1.execute:
>  Shutting down embedded http server
> I0513 22:10:20.492 THREAD29 org.mortbay.log.Slf4jLog.info: Stopped 
> [email protected]:8081
> I0513 22:10:20.594 THREAD29 
> com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle 
> state machine transition DEAD -> DEAD
> I0513 22:10:20.594 THREAD29 
> org.apache.aurora.scheduler.SchedulerLifecycle$8.execute: Shutdown already 
> invoked, ignoring extra call.
> I0513 22:10:20.595 THREAD1 com.twitter.common.application.AppLauncher.run: 
> Application run() exited.
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (AURORA-420) scheduler crash due to corrupt replica data?

Reply via email to