[ 
https://issues.apache.org/jira/browse/AURORA-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14008049#comment-14008049
 ] 

Nathan Howell commented on AURORA-470:
--------------------------------------

There is nothing else related to the job or task in the log, and no entries 
around 20:53:47. This is on the newly elected leader, which became a leader 
just recently at 20:51:30.002.

It looks like the trigger causing a leader election is ZooKeeper timing out a 
session. If the task state is updated through the replicated log I wouldn't 
think it's related though...

This is from the next failure (about an hour later), but this is happening 
semi-frequently right now even though the ZK nodes are mostly idle, maybe 
around 5-10 iops. Here are the relevant log sections:

.err (timestamps are from the file, looks like a mix of utc and pdt)
{code}
I0523 21:56:34.022322 21919 leveldb.cpp:337] Persisting action (27 bytes) to 
leveldb took 271691ns
I0523 21:56:34.022356 21919 replica.cpp:671] Persisted action at 1402962
I0523 21:56:34.022377 21919 replica.cpp:656] Replica learned APPEND action at 
position 1402962
I0524 04:56:38.080 THREAD17 org.apache.zookeeper.ClientCnxn$SendThread.run: 
Client session timed out, have not heard from server in 8300ms for sessionid 
0x1461589250bea42, closing socket connection and attempting reconnect
I0523 21:56:38.080940 21911 log.cpp:849] Attempting to append 2352 bytes to the 
log
I0523 21:56:38.081123 21903 coordinator.cpp:335] Coordinator attempting to 
write APPEND action at position 1402962
I0523 21:56:38.081984 21916 replica.cpp:515] Replica received write request for 
position 1402962
I0523 21:56:38.082059 21916 leveldb.cpp:432] Reading position from leveldb took 
23172ns
E0523 21:56:38.082765 21911 log.cpp:866] Failed to append the log: Coordinator 
demoted
I0524 04:56:38.382 THREAD17 
org.apache.zookeeper.ClientCnxn$SendThread.startConnect: Opening socket 
connection to server ...zook01.prod...
I0524 04:56:38.383 THREAD17 
org.apache.zookeeper.ClientCnxn$SendThread.primeConnection: Socket connection 
established to ...zook01.prod..., initiating session
I0524 04:56:38.384 THREAD17 org.apache.zookeeper.ClientCnxn$SendThread.run: 
Unable to reconnect to ZooKeeper service, session 0x1461589250bea42 has 
expired, closing socket connection
I0524 04:56:38.385 THREAD18 org.apache.zookeeper.ZooKeeper.<init>: Initiating 
client connection, 
connectString=...zook01.prod....,...zook02.prod...,...zook03.prod... 
sessionTimeout=3000 
watcher=com.twitter.common.zookeeper.ZooKeeperClient$2@b2c67db
I0524 04:56:38.387 THREAD7720 
org.apache.zookeeper.ClientCnxn$SendThread.startConnect: Opening socket 
connection to server ...zook02.prod...
I0524 04:56:38.388 THREAD7720 
org.apache.zookeeper.ClientCnxn$SendThread.primeConnection: Socket connection 
established to ...zook02.prod..., initiating session
I0524 04:56:38.389 THREAD7720 
org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult: Session 
establishment complete on server ...zook02..., sessionid = 0x24615892536f7ef, 
negotiated timeout = 4000
{code}

and the .log:
{code}
I0524 04:56:38.082 THREAD7718 
com.twitter.common.application.Lifecycle.shutdown: Shutting down application
I0524 04:56:38.083 THREAD7718 
com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute: 
Executing 8 shutdown commands.
I0524 04:56:38.087 THREAD7718 
org.apache.aurora.scheduler.app.AppModule$RegisterShutdownStackPrinter$2.execute:
 Shutdown initiated by: Thread: Thread-7692 (id 7718)
java.lang.Thread.getStackTrace(Thread.java:1588)
  
org.apache.aurora.scheduler.app.AppModule$RegisterShutdownStackPrinter$2.execute(AppModule.java:141)
  
com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute(ShutdownRegistry.java:88)
  com.twitter.common.application.Lifecycle.shutdown(Lifecycle.java:92)
  
org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.disableLog(MesosLog.java:354)
  
org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.mutate(MesosLog.java:374)
  
org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.append(MesosLog.java:318)
  
org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.append(MesosLog.java:148)
...
{code}

> Tasks get stuck in THROTTLED state on restart or leader change
> --------------------------------------------------------------
>
>                 Key: AURORA-470
>                 URL: https://issues.apache.org/jira/browse/AURORA-470
>             Project: Aurora
>          Issue Type: Story
>          Components: Scheduler
>    Affects Versions: 0.5.0
>            Reporter: Nathan Howell
>
> We're seeing cases where tasks get stuck in the THROTTLED state indefinitely. 
> From what I can tell from the logs, this happens if a task is throttled when 
> Aurora is shutdown or a new leader is elected.
> It looks like the timer that changes the state from THROTTLED to PENDING is 
> only setup on a transition to the THROTTLED state... it seems like there  is 
> no way to get these tasks running again except to restart them manually.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to