[jira] [Commented] (AURORA-420) scheduler crash due to corrupt replica data?

Bhuvan Arumugam (JIRA) Fri, 16 May 2014 13:47:32 -0700

    [ 
https://issues.apache.org/jira/browse/AURORA-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998976#comment-13998976
 ]


Bhuvan Arumugam commented on AURORA-420:
----------------------------------------

Looks like we should still use {{mesos-log}} for replication, going by 
{{vagrant/provision-dev-cluster.sh}} script. The problem is more like 
{{mesos-log}} command in v0.19.0 is unable to parse replica created using older 
version, in this case v0.18.0.

> scheduler crash due to corrupt replica data?
> --------------------------------------------
>
>                 Key: AURORA-420
>                 URL: https://issues.apache.org/jira/browse/AURORA-420
>             Project: Aurora
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 0.6.0
>            Reporter: Bhuvan Arumugam
>
> We are using latest as of 
> https://github.com/apache/incubator-aurora/commit/90423243977f141002319f9cd4bd59bcee33aefe.
>  Technically it's 0.5.1-snapshot.
> The scheduler seem to crash due to corrupt data in replica. It had crashed 
> twice in last 2 days. Here is the log snippet.
> Last time when we started scheduler after similar crash, all jobs were lost. 
> We were running around 30 apps in different slaves during the crash. The apps 
> are still running in slaves though. The slaves are shown as running master 
> ui. The scheduler seem to have trouble reconnecting to the running tasks when 
> it come back online. FWIW, we are not using checkpoint.
> Can you let me know?
>   1. how to prevent the crashes?
>   2. how to recover jobs from replica backup?
> {code}
> I0513 15:07:39.982774 25560 log.cpp:680] Attempting to append 125 bytes to 
> the log
> I0513 15:07:39.982879 25545 coordinator.cpp:340] Coordinator attempting to 
> write APPEND action at position 29779
> I0513 15:07:39.983695 25543 replica.cpp:508] Replica received write request 
> for position 29779
> I0513 15:07:39.986923 25543 leveldb.cpp:341] Persisting action (144 bytes) to 
> leveldb took 3.177192ms
> I0513 15:07:39.986961 25543 replica.cpp:676] Persisted action at 29779
> I0513 15:07:39.987192 25543 replica.cpp:655] Replica received learned notice 
> for position 29779
> I0513 15:07:39.989861 25543 leveldb.cpp:341] Persisting action (146 bytes) to 
> leveldb took 2.637372ms
> I0513 15:07:39.989895 25543 replica.cpp:676] Persisted action at 29779
> I0513 15:07:39.989907 25543 replica.cpp:661] Replica learned APPEND action at 
> position 29779
> I0513 22:07:46.621 THREAD5299 
> org.apache.aurora.scheduler.async.OfferQueue$OfferQueueImpl.addOffer: 
> Returning offers for 20140512-151150-360689681-5050-7152-6 for compaction.
> I0513 22:08:39.641 THREAD5301 
> org.apache.aurora.scheduler.async.OfferQueue$OfferQueueImpl.addOffer: 
> Returning offers for 20140512-151150-360689681-5050-7152-9 for compaction.
> I0513 22:10:20.474 THREAD29 
> org.apache.aurora.scheduler.SchedulerLifecycle$6$4.run: Triggering automatic 
> failover.
> I0513 22:10:20.475 THREAD29 
> com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle 
> state machine transition ACTIVE -> DEAD
> I0513 15:10:20.486500 25562 sched.cpp:731] Stopping framework 
> '2014-03-26-13:02:35-360689681-5050-31080-0000'
> I0513 22:10:20.486 THREAD29 
> com.twitter.common.util.StateMachine$Builder$1.execute: storage state machine 
> transition READY -> STOPPED
> W0513 22:10:20.486 THREAD24 
> com.twitter.common.zookeeper.ServerSetImpl$ServerSetWatcher.notifyServerSetChange:
>  server set empty for path /aurora/scheduler
> I0513 22:10:20.486 THREAD31 
> com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle 
> state machine transition DEAD -> DEAD
> I0513 22:10:20.486 THREAD29 
> com.twitter.common.application.Lifecycle.shutdown: Shutting down application
> I0513 22:10:20.487 THREAD31 
> org.apache.aurora.scheduler.SchedulerLifecycle$8.execute: Shutdown already 
> invoked, ignoring extra call.
> W0513 22:10:20.486 THREAD24 
> org.apache.aurora.scheduler.http.LeaderRedirect$SchedulerMonitor.onChange: No 
> schedulers in host set, will not redirect despite not being leader.
> I0513 22:10:20.487 THREAD29 
> com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute: 
> Executing 8 shutdown commands.
> W0513 22:10:20.488 THREAD24 
> com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange: All candidates 
> have temporarily left the group: Group /aurora/scheduler
> E0513 22:10:20.488 THREAD24 
> org.apache.aurora.scheduler.SchedulerLifecycle$SchedulerCandidateImpl.onDefeated:
>  Lost leadership, committing suicide.
> I0513 22:10:20.489 THREAD24 
> com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle 
> state machine transition DEAD -> DEAD
> I0513 22:10:20.489 THREAD24 
> org.apache.aurora.scheduler.SchedulerLifecycle$8.execute: Shutdown already 
> invoked, ignoring extra call.
> I0513 22:10:20.491 THREAD29 
> org.apache.aurora.scheduler.app.AppModule$RegisterShutdownStackPrinter$2.execute:
>  Shutdown initiated by: Thread: Lifecycle-0 (id 29)
> java.lang.Thread.getStackTrace(Thread.java:1588)
>   
> org.apache.aurora.scheduler.app.AppModule$RegisterShutdownStackPrinter$2.execute(AppModule.java:151)
>   
> com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute(ShutdownRegistry.java:88)
>   com.twitter.common.application.Lifecycle.shutdown(Lifecycle.java:92)
>   
> org.apache.aurora.scheduler.SchedulerLifecycle$8.execute(SchedulerLifecycle.java:382)
>   
> org.apache.aurora.scheduler.SchedulerLifecycle$8.execute(SchedulerLifecycle.java:354)
>   com.twitter.common.base.Closures$4.execute(Closures.java:120)
>   com.twitter.common.base.Closures$3.execute(Closures.java:98)
>   com.twitter.common.util.StateMachine.transition(StateMachine.java:191)
>   
> org.apache.aurora.scheduler.SchedulerLifecycle$6$4.run(SchedulerLifecycle.java:287)
>   java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
>   
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
>   
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   java.lang.Thread.run(Thread.java:744)
> I0513 22:10:20.491 THREAD29 
> com.twitter.common.stats.TimeSeriesRepositoryImpl$3.execute: Variable sampler 
> shut down
> I0513 22:10:20.491 THREAD29 
> org.apache.aurora.scheduler.thrift.ThriftServerLauncher$1.execute: Stopping 
> thrift server.
> I0513 22:10:20.491 THREAD29 
> org.apache.aurora.scheduler.thrift.ThriftServer.shutdown: Received shutdown 
> request, stopping server.
> I0513 22:10:20.491 THREAD29 
> org.apache.aurora.scheduler.thrift.ThriftServer.setStatus: Moving from status 
> ALIVE to STOPPING
> I0513 22:10:20.492 THREAD29 
> org.apache.aurora.scheduler.thrift.ThriftServer.setStatus: Moving from status 
> STOPPING to STOPPED
> I0513 22:10:20.492 THREAD29 
> com.twitter.common.application.modules.HttpModule$HttpServerLauncher$1.execute:
>  Shutting down embedded http server
> I0513 22:10:20.492 THREAD29 org.mortbay.log.Slf4jLog.info: Stopped 
> [email protected]:8081
> I0513 22:10:20.594 THREAD29 
> com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle 
> state machine transition DEAD -> DEAD
> I0513 22:10:20.594 THREAD29 
> org.apache.aurora.scheduler.SchedulerLifecycle$8.execute: Shutdown already 
> invoked, ignoring extra call.
> I0513 22:10:20.595 THREAD1 com.twitter.common.application.AppLauncher.run: 
> Application run() exited.
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (AURORA-420) scheduler crash due to corrupt replica data?

Reply via email to