Issue #3362 has been updated by Simon Mügge.
Nigel Kersten wrote: > Have we been able to confirm Dan's original assumptions? > > I'm not seeing widespread clamor over this issue other than Mark's request to > not change existing behavior :) I "just" ran into this issue as well, after trying to figure out why the load on my masters is peaking so bad - it had become as predictable as "Is it ten past? Works as intended then, come back in 5..". I am trying to serve ~4500 clients via two 16core 24GB ram machines which should (!?) be enought to handle this, but.. 2 runs per hour, splaylimit unset (so same as runinterval) at the moment and I get 6 peaks per hour whith next to no connections inbetween. Changed this (just to change something, really) from a splaylimt of 20 minutes to the current 30, first couple of hours seemed a tiny bit better, but now, after 4 days, it seems as if i've just reordered my thundering herd.. :/ And you can watch the peaks shifting slowly (give it 2-4 weeks from now) towards one another, because not all peaks cary the same load, which would support the original assumption. Since I just found this patch and things like these take a while to implement around here it might be quite a while until I can give any feedback on the patch, I'm sorry to say. ---------------------------------------- Bug #3362: splay drift occurs when passenger/mongrel get too much load. https://projects.puppetlabs.com/issues/3362#change-69415 Author: Dan Bode Status: Needs More Information Priority: Low Assignee: Nigel Kersten Category: plumbing Target version: 2.7.x Affected Puppet version: 0.25.4 Keywords: passenger load splay mongrel connection timeouts Branch: http://github.com/MarkusQ/puppet/tree/ticket/0.25.x/3362 not sure if this counts as a bug... I could not concretely prove the assumptions below. I did some investigation and this is my best guess as to the cause. Splay was drifting for hundreds of machines so that over time, most were checking in at the same time, while at other times none were checking in. here is my theory as to why. splay only runs the first time after puppet starts. Assumption: runinterval starts counting only after the client finishes its last run? Here is the chain of events that I think causes this: 1. passenger or mongrel is under heavy load. 2. processes get used up, they start queuing hosts. 3. Once a machine falls into the queue, it gets stuck with the group of machines that cause the queue to fill up, since it will now use runinterval and check in at the same time as the other machines that were running at that same time. 4. Over time, splay drifts so that most machines are checking in at the same time. Basically, once performance starts getting bad, the splaying falls apart so that it gets much worse. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here: http://projects.puppetlabs.com/my/account -- You received this message because you are subscribed to the Google Groups "Puppet Bugs" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/puppet-bugs?hl=en.
