[Puppet - Bug #3362] splay drift occurs when passenger/mongrel get too much load.

tickets Mon, 20 Aug 2012 09:35:56 -0700

Issue #3362 has been updated by Simon Mügge.

Nigel Kersten wrote:
> Have we been able to confirm Dan's original assumptions?
> 
> I'm not seeing widespread clamor over this issue other than Mark's request to 
> not change existing behavior :)

I "just" ran into this issue as well, after trying to figure out why the load 
on my masters is peaking so bad - it had become as predictable as "Is it ten 
past? Works as intended then, come back in 5..".

I am trying to serve ~4500 clients via two 16core 24GB ram machines which 
should (!?) be enought to handle this, but.. 
2 runs per hour, splaylimit unset (so same as runinterval) at the moment and I 
get 6 peaks per hour whith next to no connections inbetween.
Changed this (just to change something, really) from a splaylimt of 20 minutes 
to the current 30, first couple of hours seemed a tiny bit better, but now, 
after 4 days, it seems as if i've just reordered my thundering herd.. :/
And you can watch the peaks shifting slowly (give it 2-4 weeks from now) 
towards one another, because not all peaks cary the same load, which would 
support the original assumption.

Since I just found this patch and things like these take a while to implement 
around here it might be quite a while until I can give any feedback on the 
patch, I'm sorry to say.
----------------------------------------
Bug #3362: splay drift occurs when passenger/mongrel get too much load.
https://projects.puppetlabs.com/issues/3362#change-69415

Author: Dan Bode
Status: Needs More Information
Priority: Low
Assignee: Nigel Kersten
Category: plumbing
Target version: 2.7.x
Affected Puppet version: 0.25.4
Keywords: passenger load splay mongrel connection timeouts
Branch: http://github.com/MarkusQ/puppet/tree/ticket/0.25.x/3362

not sure if this counts as a bug...

I could not concretely prove the assumptions below. I did some investigation 
and this is my best guess as to the cause.

Splay was drifting for hundreds of machines so that over time, most were 
checking in at the same time, while at other times none were checking in. here 
is my theory as to why.

splay only runs the first time after puppet starts.

Assumption: runinterval starts counting only after the client finishes its last 
run?

Here is the chain of events that I think causes this:

1. passenger or mongrel is under heavy load.
2. processes get used up, they start queuing hosts.
3. Once a machine falls into the queue, it gets stuck with the group of 
machines that cause the queue to fill up, since it will now use runinterval and 
check in at the same time as the other machines that were running at that same 
time.
4. Over time, splay drifts so that most machines are checking in at the same 
time.

Basically, once performance starts getting bad, the splaying falls apart so 
that it gets much worse.

-- 
You have received this notification because you have either subscribed to it, 
or are involved in it.
To change your notification preferences, please click here: 
http://projects.puppetlabs.com/my/account

-- 
You received this message because you are subscribed to the Google Groups 
"Puppet Bugs" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/puppet-bugs?hl=en.

[Puppet - Bug #3362] splay drift occurs when passenger/mongrel get too much load.

Reply via email to