[ 
https://issues.apache.org/jira/browse/AURORA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1840:
--------------------------------
    Fix Version/s: 0.17.0

> Issue with Curator-backed discovery under heavy load
> ----------------------------------------------------
>
>                 Key: AURORA-1840
>                 URL: https://issues.apache.org/jira/browse/AURORA-1840
>             Project: Aurora
>          Issue Type: Bug
>          Components: Scheduler
>            Reporter: David McLaughlin
>            Assignee: David McLaughlin
>            Priority: Blocker
>             Fix For: 0.17.0
>
>
> We've been having some performance issues recently with our production 
> clusters at Twitter. A side-effect of these are occassional stop-the-world GC 
> pauses for up to 15 seconds. This has been happening at our scale for quite 
> some time, but previous versions of the Scheduler were resilient to this and 
> no leadership change would occur. 
> Since we moved to Curator, we are no longer resilient to these GC pauses. The 
> Scheduler is now failing over any time we see a GC pause, even though these 
> pauses are within the session timeout. Here is an example pause in the 
> scheduler logs with the associated ZK session timeout that leads to a 
> failover:
> {code}
> I1118 19:40:16.871801 51800 sched.cpp:1025] Scheduler::statusUpdate took 
> 586236ns
> I1118 19:40:16.902 [TaskGroupBatchWorker, StateMachine$Builder:389] 
> redacted-9f565b4-067e-422f-b641-c6000f9ae2c8 state machine transition PENDING 
> -> ASSIGNED 
> I1118 19:40:16.903 [TaskGroupBatchWorker, TaskStateMachine:474] Adding work 
> command SAVE_STATE for redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8 
> I1118 19:40:16.903 [TaskGroupBatchWorker, TaskAssigner$TaskAssignerImpl:130] 
> Offer on agent redacted (id 566ae347-c1b6-44ce-8551-b7a6cda72989-S7579) is 
> being assigned task redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8. 
> W1118 19:40:31.744 [main-SendThread(redacted:2181), 
> ClientCnxn$SendThread:1108] Client session timed out, have not heard from 
> server in 20743ms for sessionid 0x6584fd2b34ede86 
> {code}
> As you can see from the timestamps, there was a 15s GC pause (confirmed in 
> our GC logs - a CMS promotion failure caused the pause) and this triggers a 
> session timeout of 20s to fire. Note: we have seen GC pauses as little as 7s 
> cause the same behavior. Removed: my ZK was rusty. 20s is 2/3 of our 30s ZK 
> timeout, so our session timeout is being wired through fine. 
> We have confirmed that the Scheduler no longer fails over when deploying from 
> HEAD with these two commits reverted and setting zk_use_curator to false:
> https://github.com/apache/aurora/commit/b417be38fe1fcae6b85f7e91cea961ab272adf3f
> https://github.com/apache/aurora/commit/69cba786efc2628eab566201dfea46836a1d9af5
> This is a pretty big blocker for us given how expensive Scheduler failovers 
> are (currently several minutes for us). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to