[ 
https://issues.apache.org/jira/browse/AURORA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15712221#comment-15712221
 ] 

Stephan Erb commented on AURORA-1840:
-------------------------------------

Reverting as a tactical solution until our Curator implementation has been 
improved sounds good to me.

Regarding Curator: 
http://curator.apache.org/curator-recipes/leader-election.html seems more 
suitable for our usecase. It is call-back based and offers more control:
{quote}
The LeaderSelectorListener class extends ConnectionStateListener. When the 
LeaderSelector is started, it adds the listener to the Curator instance. Users 
of the LeaderSelector must pay attention to any connection state changes. If an 
instance becomes the leader, it should respond to notification of being 
SUSPENDED or LOST. If the SUSPENDED state is reported, the instance must assume 
that it might no longer be the leader until it receives a RECONNECTED state. If 
the LOST state is reported, the instance is no longer the leader and its 
takeLeadership method should exit.

IMPORTANT: The recommended action for receiving SUSPENDED or LOST is to throw 
CancelLeadershipException. This will cause the LeaderSelector instance to 
attempt to interrupt and cancel the thread that is executing the takeLeadership 
method. Because this is so important, you should consider extending 
LeaderSelectorListenerAdapter. LeaderSelectorListenerAdapter has the 
recommended handling already written for you.
{quote}

> Issue with Curator-backed discovery under heavy load
> ----------------------------------------------------
>
>                 Key: AURORA-1840
>                 URL: https://issues.apache.org/jira/browse/AURORA-1840
>             Project: Aurora
>          Issue Type: Bug
>          Components: Scheduler
>            Reporter: David McLaughlin
>            Assignee: David McLaughlin
>            Priority: Blocker
>             Fix For: 0.17.0
>
>
> We've been having some performance issues recently with our production 
> clusters at Twitter. A side-effect of these are occassional stop-the-world GC 
> pauses for up to 15 seconds. This has been happening at our scale for quite 
> some time, but previous versions of the Scheduler were resilient to this and 
> no leadership change would occur. 
> Since we moved to Curator, we are no longer resilient to these GC pauses. The 
> Scheduler is now failing over any time we see a GC pause, even though these 
> pauses are within the session timeout. Here is an example pause in the 
> scheduler logs with the associated ZK session timeout that leads to a 
> failover:
> {code}
> I1118 19:40:16.871801 51800 sched.cpp:1025] Scheduler::statusUpdate took 
> 586236ns
> I1118 19:40:16.902 [TaskGroupBatchWorker, StateMachine$Builder:389] 
> redacted-9f565b4-067e-422f-b641-c6000f9ae2c8 state machine transition PENDING 
> -> ASSIGNED 
> I1118 19:40:16.903 [TaskGroupBatchWorker, TaskStateMachine:474] Adding work 
> command SAVE_STATE for redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8 
> I1118 19:40:16.903 [TaskGroupBatchWorker, TaskAssigner$TaskAssignerImpl:130] 
> Offer on agent redacted (id 566ae347-c1b6-44ce-8551-b7a6cda72989-S7579) is 
> being assigned task redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8. 
> W1118 19:40:31.744 [main-SendThread(redacted:2181), 
> ClientCnxn$SendThread:1108] Client session timed out, have not heard from 
> server in 20743ms for sessionid 0x6584fd2b34ede86 
> {code}
> As you can see from the timestamps, there was a 15s GC pause (confirmed in 
> our GC logs - a CMS promotion failure caused the pause) and this triggers a 
> session timeout of 20s to fire. Note: we have seen GC pauses as little as 7s 
> cause the same behavior. Removed: my ZK was rusty. 20s is 2/3 of our 30s ZK 
> timeout, so our session timeout is being wired through fine. 
> We have confirmed that the Scheduler no longer fails over when deploying from 
> HEAD with these two commits reverted and setting zk_use_curator to false:
> https://github.com/apache/aurora/commit/b417be38fe1fcae6b85f7e91cea961ab272adf3f
> https://github.com/apache/aurora/commit/69cba786efc2628eab566201dfea46836a1d9af5
> This is a pretty big blocker for us given how expensive Scheduler failovers 
> are (currently several minutes for us). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to