David McLaughlin commented on AURORA-1840:

I figured this out. The old zk commons approach was a session-based 

The new Curator-based recipe is based on Curator's LeaderLatch which is 
explicitly a connection-based leadership concept:

>From http://curator.apache.org/curator-recipes/leader-latch.html

{quote}LeaderLatch instances add a *ConnectionStateListener* to watch for 
connection problems. If SUSPENDED or LOST is reported, the LeaderLatch that is 
the leader will report that it is no longer the leader (i.e. *there will not be 
a leader until the connection is re-established*). If a LOST connection is 
RECONNECTED, the LeaderLatch will delete its previous ZNode and create a new 

This is a really terrible idea for a Scheduler that can take minutes to fail 
over. We'll need to revert the above commits until someone can come up with a 
better Curator-backed recipe. 

> Issue with Curator-backed discovery under heavy load
> ----------------------------------------------------
>                 Key: AURORA-1840
>                 URL: https://issues.apache.org/jira/browse/AURORA-1840
>             Project: Aurora
>          Issue Type: Bug
>          Components: Scheduler
>            Reporter: David McLaughlin
>            Priority: Blocker
> We've been having some performance issues recently with our production 
> clusters at Twitter. A side-effect of these are occassional stop-the-world GC 
> pauses for up to 15 seconds. This has been happening at our scale for quite 
> some time, but previous versions of the Scheduler were resilient to this and 
> no leadership change would occur. 
> Since we moved to Curator, we are no longer resilient to these GC pauses. The 
> Scheduler is now failing over any time we see a GC pause, even though these 
> pauses are within the session timeout. Here is an example pause in the 
> scheduler logs with the associated ZK session timeout that leads to a 
> failover:
> {code}
> I1118 19:40:16.871801 51800 sched.cpp:1025] Scheduler::statusUpdate took 
> 586236ns
> I1118 19:40:16.902 [TaskGroupBatchWorker, StateMachine$Builder:389] 
> redacted-9f565b4-067e-422f-b641-c6000f9ae2c8 state machine transition PENDING 
> I1118 19:40:16.903 [TaskGroupBatchWorker, TaskStateMachine:474] Adding work 
> command SAVE_STATE for redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8 
> I1118 19:40:16.903 [TaskGroupBatchWorker, TaskAssigner$TaskAssignerImpl:130] 
> Offer on agent redacted (id 566ae347-c1b6-44ce-8551-b7a6cda72989-S7579) is 
> being assigned task redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8. 
> W1118 19:40:31.744 [main-SendThread(redacted:2181), 
> ClientCnxn$SendThread:1108] Client session timed out, have not heard from 
> server in 20743ms for sessionid 0x6584fd2b34ede86 
> {code}
> As you can see from the timestamps, there was a 15s GC pause (confirmed in 
> our GC logs - a CMS promotion failure caused the pause) and this triggers a 
> session timeout of 20s to fire. Note: we have seen GC pauses as little as 7s 
> cause the same behavior. Removed: my ZK was rusty. 20s is 2/3 of our 30s ZK 
> timeout, so our session timeout is being wired through fine. 
> We have confirmed that the Scheduler no longer fails over when deploying from 
> HEAD with these two commits reverted and setting zk_use_curator to false:
> https://github.com/apache/aurora/commit/b417be38fe1fcae6b85f7e91cea961ab272adf3f
> https://github.com/apache/aurora/commit/69cba786efc2628eab566201dfea46836a1d9af5
> This is a pretty big blocker for us given how expensive Scheduler failovers 
> are (currently several minutes for us). 

This message was sent by Atlassian JIRA

Reply via email to