[ https://issues.apache.org/jira/browse/AURORA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15977710#comment-15977710 ]
David McLaughlin commented on AURORA-1840: ------------------------------------------ Thanks John. Good to know we can customize it. For, I think that should be a blocker to making Curator the default and removing the twitter-commons implementation. > Issue with Curator-backed discovery under heavy load > ---------------------------------------------------- > > Key: AURORA-1840 > URL: https://issues.apache.org/jira/browse/AURORA-1840 > Project: Aurora > Issue Type: Bug > Components: Scheduler > Reporter: David McLaughlin > Assignee: David McLaughlin > Priority: Blocker > Fix For: 0.17.0 > > > We've been having some performance issues recently with our production > clusters at Twitter. A side-effect of these are occassional stop-the-world GC > pauses for up to 15 seconds. This has been happening at our scale for quite > some time, but previous versions of the Scheduler were resilient to this and > no leadership change would occur. > Since we moved to Curator, we are no longer resilient to these GC pauses. The > Scheduler is now failing over any time we see a GC pause, even though these > pauses are within the session timeout. Here is an example pause in the > scheduler logs with the associated ZK session timeout that leads to a > failover: > {code} > I1118 19:40:16.871801 51800 sched.cpp:1025] Scheduler::statusUpdate took > 586236ns > I1118 19:40:16.902 [TaskGroupBatchWorker, StateMachine$Builder:389] > redacted-9f565b4-067e-422f-b641-c6000f9ae2c8 state machine transition PENDING > -> ASSIGNED > I1118 19:40:16.903 [TaskGroupBatchWorker, TaskStateMachine:474] Adding work > command SAVE_STATE for redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8 > I1118 19:40:16.903 [TaskGroupBatchWorker, TaskAssigner$TaskAssignerImpl:130] > Offer on agent redacted (id 566ae347-c1b6-44ce-8551-b7a6cda72989-S7579) is > being assigned task redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8. > W1118 19:40:31.744 [main-SendThread(redacted:2181), > ClientCnxn$SendThread:1108] Client session timed out, have not heard from > server in 20743ms for sessionid 0x6584fd2b34ede86 > {code} > As you can see from the timestamps, there was a 15s GC pause (confirmed in > our GC logs - a CMS promotion failure caused the pause) and this triggers a > session timeout of 20s to fire. Note: we have seen GC pauses as little as 7s > cause the same behavior. Removed: my ZK was rusty. 20s is 2/3 of our 30s ZK > timeout, so our session timeout is being wired through fine. > We have confirmed that the Scheduler no longer fails over when deploying from > HEAD with these two commits reverted and setting zk_use_curator to false: > https://github.com/apache/aurora/commit/b417be38fe1fcae6b85f7e91cea961ab272adf3f > https://github.com/apache/aurora/commit/69cba786efc2628eab566201dfea46836a1d9af5 > This is a pretty big blocker for us given how expensive Scheduler failovers > are (currently several minutes for us). -- This message was sent by Atlassian JIRA (v6.3.15#6346)