[
https://issues.apache.org/jira/browse/AURORA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15977689#comment-15977689
]
John Sirois commented on AURORA-1840:
-------------------------------------
[~davmclau] - I think the Curator patch I linked above
(https://issues.apache.org/jira/browse/CURATOR-248) addresses this for the 3.x
series - you can install a custom error handler that behaves in the way Aurora
desires.
> Issue with Curator-backed discovery under heavy load
> ----------------------------------------------------
>
> Key: AURORA-1840
> URL: https://issues.apache.org/jira/browse/AURORA-1840
> Project: Aurora
> Issue Type: Bug
> Components: Scheduler
> Reporter: David McLaughlin
> Assignee: David McLaughlin
> Priority: Blocker
> Fix For: 0.17.0
>
>
> We've been having some performance issues recently with our production
> clusters at Twitter. A side-effect of these are occassional stop-the-world GC
> pauses for up to 15 seconds. This has been happening at our scale for quite
> some time, but previous versions of the Scheduler were resilient to this and
> no leadership change would occur.
> Since we moved to Curator, we are no longer resilient to these GC pauses. The
> Scheduler is now failing over any time we see a GC pause, even though these
> pauses are within the session timeout. Here is an example pause in the
> scheduler logs with the associated ZK session timeout that leads to a
> failover:
> {code}
> I1118 19:40:16.871801 51800 sched.cpp:1025] Scheduler::statusUpdate took
> 586236ns
> I1118 19:40:16.902 [TaskGroupBatchWorker, StateMachine$Builder:389]
> redacted-9f565b4-067e-422f-b641-c6000f9ae2c8 state machine transition PENDING
> -> ASSIGNED
> I1118 19:40:16.903 [TaskGroupBatchWorker, TaskStateMachine:474] Adding work
> command SAVE_STATE for redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8
> I1118 19:40:16.903 [TaskGroupBatchWorker, TaskAssigner$TaskAssignerImpl:130]
> Offer on agent redacted (id 566ae347-c1b6-44ce-8551-b7a6cda72989-S7579) is
> being assigned task redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8.
> W1118 19:40:31.744 [main-SendThread(redacted:2181),
> ClientCnxn$SendThread:1108] Client session timed out, have not heard from
> server in 20743ms for sessionid 0x6584fd2b34ede86
> {code}
> As you can see from the timestamps, there was a 15s GC pause (confirmed in
> our GC logs - a CMS promotion failure caused the pause) and this triggers a
> session timeout of 20s to fire. Note: we have seen GC pauses as little as 7s
> cause the same behavior. Removed: my ZK was rusty. 20s is 2/3 of our 30s ZK
> timeout, so our session timeout is being wired through fine.
> We have confirmed that the Scheduler no longer fails over when deploying from
> HEAD with these two commits reverted and setting zk_use_curator to false:
> https://github.com/apache/aurora/commit/b417be38fe1fcae6b85f7e91cea961ab272adf3f
> https://github.com/apache/aurora/commit/69cba786efc2628eab566201dfea46836a1d9af5
> This is a pretty big blocker for us given how expensive Scheduler failovers
> are (currently several minutes for us).
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)