[
https://issues.apache.org/jira/browse/AURORA-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14008049#comment-14008049
]
Nathan Howell commented on AURORA-470:
--------------------------------------
There is nothing else related to the job or task in the log, and no entries
around 20:53:47. This is on the newly elected leader, which became a leader
just recently at 20:51:30.002.
It looks like the trigger causing a leader election is ZooKeeper timing out a
session. If the task state is updated through the replicated log I wouldn't
think it's related though...
This is from the next failure (about an hour later), but this is happening
semi-frequently right now even though the ZK nodes are mostly idle, maybe
around 5-10 iops. Here are the relevant log sections:
.err (timestamps are from the file, looks like a mix of utc and pdt)
{code}
I0523 21:56:34.022322 21919 leveldb.cpp:337] Persisting action (27 bytes) to
leveldb took 271691ns
I0523 21:56:34.022356 21919 replica.cpp:671] Persisted action at 1402962
I0523 21:56:34.022377 21919 replica.cpp:656] Replica learned APPEND action at
position 1402962
I0524 04:56:38.080 THREAD17 org.apache.zookeeper.ClientCnxn$SendThread.run:
Client session timed out, have not heard from server in 8300ms for sessionid
0x1461589250bea42, closing socket connection and attempting reconnect
I0523 21:56:38.080940 21911 log.cpp:849] Attempting to append 2352 bytes to the
log
I0523 21:56:38.081123 21903 coordinator.cpp:335] Coordinator attempting to
write APPEND action at position 1402962
I0523 21:56:38.081984 21916 replica.cpp:515] Replica received write request for
position 1402962
I0523 21:56:38.082059 21916 leveldb.cpp:432] Reading position from leveldb took
23172ns
E0523 21:56:38.082765 21911 log.cpp:866] Failed to append the log: Coordinator
demoted
I0524 04:56:38.382 THREAD17
org.apache.zookeeper.ClientCnxn$SendThread.startConnect: Opening socket
connection to server ...zook01.prod...
I0524 04:56:38.383 THREAD17
org.apache.zookeeper.ClientCnxn$SendThread.primeConnection: Socket connection
established to ...zook01.prod..., initiating session
I0524 04:56:38.384 THREAD17 org.apache.zookeeper.ClientCnxn$SendThread.run:
Unable to reconnect to ZooKeeper service, session 0x1461589250bea42 has
expired, closing socket connection
I0524 04:56:38.385 THREAD18 org.apache.zookeeper.ZooKeeper.<init>: Initiating
client connection,
connectString=...zook01.prod....,...zook02.prod...,...zook03.prod...
sessionTimeout=3000
watcher=com.twitter.common.zookeeper.ZooKeeperClient$2@b2c67db
I0524 04:56:38.387 THREAD7720
org.apache.zookeeper.ClientCnxn$SendThread.startConnect: Opening socket
connection to server ...zook02.prod...
I0524 04:56:38.388 THREAD7720
org.apache.zookeeper.ClientCnxn$SendThread.primeConnection: Socket connection
established to ...zook02.prod..., initiating session
I0524 04:56:38.389 THREAD7720
org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult: Session
establishment complete on server ...zook02..., sessionid = 0x24615892536f7ef,
negotiated timeout = 4000
{code}
and the .log:
{code}
I0524 04:56:38.082 THREAD7718
com.twitter.common.application.Lifecycle.shutdown: Shutting down application
I0524 04:56:38.083 THREAD7718
com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute:
Executing 8 shutdown commands.
I0524 04:56:38.087 THREAD7718
org.apache.aurora.scheduler.app.AppModule$RegisterShutdownStackPrinter$2.execute:
Shutdown initiated by: Thread: Thread-7692 (id 7718)
java.lang.Thread.getStackTrace(Thread.java:1588)
org.apache.aurora.scheduler.app.AppModule$RegisterShutdownStackPrinter$2.execute(AppModule.java:141)
com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute(ShutdownRegistry.java:88)
com.twitter.common.application.Lifecycle.shutdown(Lifecycle.java:92)
org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.disableLog(MesosLog.java:354)
org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.mutate(MesosLog.java:374)
org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.append(MesosLog.java:318)
org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.append(MesosLog.java:148)
...
{code}
> Tasks get stuck in THROTTLED state on restart or leader change
> --------------------------------------------------------------
>
> Key: AURORA-470
> URL: https://issues.apache.org/jira/browse/AURORA-470
> Project: Aurora
> Issue Type: Story
> Components: Scheduler
> Affects Versions: 0.5.0
> Reporter: Nathan Howell
>
> We're seeing cases where tasks get stuck in the THROTTLED state indefinitely.
> From what I can tell from the logs, this happens if a task is throttled when
> Aurora is shutdown or a new leader is elected.
> It looks like the timer that changes the state from THROTTLED to PENDING is
> only setup on a transition to the THROTTLED state... it seems like there is
> no way to get these tasks running again except to restart them manually.
--
This message was sent by Atlassian JIRA
(v6.2#6252)