[ 
https://issues.apache.org/jira/browse/AURORA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David McLaughlin updated AURORA-1840:
-------------------------------------
    Description: 
We've been having some performance issues recently with our production clusters 
at Twitter. A side-effect of these are occassional stop-the-world GC pauses for 
up to 15 seconds. This has been happening at our scale for quite some time, but 
previous versions of the Scheduler were resilient to this and no leadership 
change would occur. 

Since we moved to Curator, we are no longer resilient to these GC pauses. The 
Scheduler is now failing over any time we see a GC pause, even though these 
pauses are within the session timeout. Here is an example pause in the 
scheduler logs with the associated ZK session timeout that leads to a failover:

{code}
I1118 19:40:16.871801 51800 sched.cpp:1025] Scheduler::statusUpdate took 
586236ns
I1118 19:40:16.902 [TaskGroupBatchWorker, StateMachine$Builder:389] 
redacted-9f565b4-067e-422f-b641-c6000f9ae2c8 state machine transition PENDING 
-> ASSIGNED 
I1118 19:40:16.903 [TaskGroupBatchWorker, TaskStateMachine:474] Adding work 
command SAVE_STATE for redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8 
I1118 19:40:16.903 [TaskGroupBatchWorker, TaskAssigner$TaskAssignerImpl:130] 
Offer on agent redacted (id 566ae347-c1b6-44ce-8551-b7a6cda72989-S7579) is 
being assigned task redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8. 
W1118 19:40:31.744 [main-SendThread(redacted:2181), ClientCnxn$SendThread:1108] 
Client session timed out, have not heard from server in 20743ms for sessionid 
0x6584fd2b34ede86 
{code}

As you can see from the timestamps, there was a 15s GC pause (confirmed in our 
GC logs - a CMS promotion failure caused the pause) and this triggers a session 
timeout of 20s to fire. Note: we have seen GC pauses as little as 7s cause the 
same behavior. Removed: my ZK was rusty. 20s is 2/3 of our 30s ZK timeout, so 
our session timeout is being wired through fine. 

We have confirmed that the Scheduler no longer fails over when deploying from 
HEAD with these two commits reverted and setting zk_use_curator to false:

https://github.com/apache/aurora/commit/b417be38fe1fcae6b85f7e91cea961ab272adf3f
https://github.com/apache/aurora/commit/69cba786efc2628eab566201dfea46836a1d9af5

This is a pretty big blocker for us given how expensive Scheduler failovers are 
(currently several minutes for us). 

  was:
We've been having some performance issues recently with our production clusters 
at Twitter. A side-effect of these are occassional stop-the-world GC pauses for 
up to 15 seconds. This has been happening at our scale for quite some time, but 
previous versions of the Scheduler were resilient to this and no leadership 
change would occur. 

Since we moved to Curator, we are no longer resilient to these GC pauses. The 
Scheduler is now failing over any time we see a GC pause, even though these 
pauses are within the session timeout. Here is an example pause in the 
scheduler logs with the associated ZK session timeout that leads to a failover:

{code}
I1118 19:40:16.871801 51800 sched.cpp:1025] Scheduler::statusUpdate took 
586236ns
I1118 19:40:16.902 [TaskGroupBatchWorker, StateMachine$Builder:389] 
redacted-9f565b4-067e-422f-b641-c6000f9ae2c8 state machine transition PENDING 
-> ASSIGNED 
I1118 19:40:16.903 [TaskGroupBatchWorker, TaskStateMachine:474] Adding work 
command SAVE_STATE for redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8 
I1118 19:40:16.903 [TaskGroupBatchWorker, TaskAssigner$TaskAssignerImpl:130] 
Offer on agent redacted (id 566ae347-c1b6-44ce-8551-b7a6cda72989-S7579) is 
being assigned task redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8. 
W1118 19:40:31.744 [main-SendThread(redacted:2181), ClientCnxn$SendThread:1108] 
Client session timed out, have not heard from server in 20743ms for sessionid 
0x6584fd2b34ede86 
{code}

As you can see from the timestamps, there was a 15s GC pause (confirmed in our 
GC logs - a CMS promotion failure caused the pause) and this somehow triggers a 
session timeout of 20s to fire. Note: we have seen GC pauses as little as 7s 
cause the same behavior. 

That's problem one. The second problem is that we are using the 
'zk_session_timeout' flag and have the value set to 30s. This does not seem to 
be passed to the Curator client as it is getting a value of 20s from somewhere. 

We have confirmed that the Scheduler no longer fails over when deploying from 
HEAD with these two commits reverted and setting zk_use_curator to false:

https://github.com/apache/aurora/commit/b417be38fe1fcae6b85f7e91cea961ab272adf3f
https://github.com/apache/aurora/commit/69cba786efc2628eab566201dfea46836a1d9af5

This is a pretty big blocker for us given how expensive Scheduler failovers are 
(currently several minutes for us). 


> Issue with Curator-backed discovery under heavy load
> ----------------------------------------------------
>
>                 Key: AURORA-1840
>                 URL: https://issues.apache.org/jira/browse/AURORA-1840
>             Project: Aurora
>          Issue Type: Bug
>          Components: Scheduler
>            Reporter: David McLaughlin
>            Priority: Blocker
>
> We've been having some performance issues recently with our production 
> clusters at Twitter. A side-effect of these are occassional stop-the-world GC 
> pauses for up to 15 seconds. This has been happening at our scale for quite 
> some time, but previous versions of the Scheduler were resilient to this and 
> no leadership change would occur. 
> Since we moved to Curator, we are no longer resilient to these GC pauses. The 
> Scheduler is now failing over any time we see a GC pause, even though these 
> pauses are within the session timeout. Here is an example pause in the 
> scheduler logs with the associated ZK session timeout that leads to a 
> failover:
> {code}
> I1118 19:40:16.871801 51800 sched.cpp:1025] Scheduler::statusUpdate took 
> 586236ns
> I1118 19:40:16.902 [TaskGroupBatchWorker, StateMachine$Builder:389] 
> redacted-9f565b4-067e-422f-b641-c6000f9ae2c8 state machine transition PENDING 
> -> ASSIGNED 
> I1118 19:40:16.903 [TaskGroupBatchWorker, TaskStateMachine:474] Adding work 
> command SAVE_STATE for redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8 
> I1118 19:40:16.903 [TaskGroupBatchWorker, TaskAssigner$TaskAssignerImpl:130] 
> Offer on agent redacted (id 566ae347-c1b6-44ce-8551-b7a6cda72989-S7579) is 
> being assigned task redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8. 
> W1118 19:40:31.744 [main-SendThread(redacted:2181), 
> ClientCnxn$SendThread:1108] Client session timed out, have not heard from 
> server in 20743ms for sessionid 0x6584fd2b34ede86 
> {code}
> As you can see from the timestamps, there was a 15s GC pause (confirmed in 
> our GC logs - a CMS promotion failure caused the pause) and this triggers a 
> session timeout of 20s to fire. Note: we have seen GC pauses as little as 7s 
> cause the same behavior. Removed: my ZK was rusty. 20s is 2/3 of our 30s ZK 
> timeout, so our session timeout is being wired through fine. 
> We have confirmed that the Scheduler no longer fails over when deploying from 
> HEAD with these two commits reverted and setting zk_use_curator to false:
> https://github.com/apache/aurora/commit/b417be38fe1fcae6b85f7e91cea961ab272adf3f
> https://github.com/apache/aurora/commit/69cba786efc2628eab566201dfea46836a1d9af5
> This is a pretty big blocker for us given how expensive Scheduler failovers 
> are (currently several minutes for us). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to