Issue #11381 has been updated by Eric Shamow.
Hi Jo, Jo Rhett wrote: > 20 processes that are 104mb in size don't consume 8GB of memory. The exact > size of the ruby processes is a known factor here. Yes, they might be bigger > than they would in 1.8.7, but that's not the nature of the problem. The > question is where all these spare processes are coming from. > > Imagine that they are half the size (I doubt it) in 1.8.7 -- if ruby spawns > processes out of control, it will still consume all available memory. Just > not as fast. If they're leaking memory they might. The spare processes could be coming from anywhere in that ecosystem - in fact Puppet is the least likely source. Before you hit Puppet you've got Apache, Passenger, and Ruby itself all trying to interact. > They receive kicks, but apply the splay before they act. The net effect here > is that a large enough splay to prevent the massive jumps we see right now > (>5 minutes) means that the useful effect of "puppet kick" is reduced to > zero. Splay is frankly improperly implemented, or rather is not the > solution for even load on the puppetmaster. This is a feature you can certainly request (the ability to pass something like --nosplay to the agent), but I'm not sure how it helps you. If the issue is load when all the clients hit simultaneously, how does turning off splay when you kick help you? It'll likely bring your master to its knees if it's struggling now. In general splay and kick address two different use cases - one for very large scale, the other for targeted deployment. The solution we've introduced as a middle ground is Mcollective, which is not so much a separate technology as an addition to the Puppet ecosystem. > Whether I want to or not isn't the issue. I don't have the time to invest in > learning MC because I have mission critical issues I should be working on, > but I'm not because puppetmaster is falling over and we've tied too many > tools to it. The question being asked in my shop is no longer "what else can > puppet do" but "how can we take functionality out of puppet to avoid these > outages" I understand about not having the time to learn Mcollective, but it's also reasonable for us to say "you need to do things this way in order to achieve your goal." It's like driving your car in first gear...you may wish to avoid learning how to shift into second, but then the scope of what you can accomplish is limited. > Imagine if you found an issue with ruby and opened a bug, and they said "oh > you should learn python -- that's what everyone doing that uses". It's not > really an answer to the problem. It's entirely orthogonal to the issue at > hand. Implementing MC to fix puppet load balance is like implementing cron > to handle puppet balancing, it's just a workaround for the failure. Why not > simply invest time in implementing cfengine? Mcollective isn't orthogonal. The more accurate analogy would be if the Ruby developers said "try this library we provide which solves all your issues." If you don't have time to learn that library, it's reasonable for the Ruby developers to do their best to help you while reminding you that your solution is non-optimal. > --not trying to be nasty, it's actually an honest question. If you have no > solution to Puppet's issues but to implement a different framework, why not > implement a different framework that is known and stable without tens of > thousands of clients? Not trying to bash you, but to point out why your > current answer isn't going to encourage anyone. Puppet is known stable at a far larger scale than what you're dealing with, so what we need to do is find a way to get your deployment functioning properly. You are dealing with lots of moving parts, and it is reasonable when one of those parts is known defective for a company to say "we won't work with you to fix that part, but rather recommend replacing it with another" - particularly when the new part is freely available. Put simply, it isn't worth the engineering effort to try and debug a memory issue with a new Passenger, Apache, and Puppet against a version of Ruby that is ~ 6 years old, when a supported and current version is easily available. You are absolutely free to operate outside the bounds of our recommended versioning, but it's going to be difficult to find answers and support - not just from us but also from the community - if you do so. If you'd like to give this a shot with a newer Ruby, I (or others here) will happily assist you in getting things running. -Eric ---------------------------------------- Bug #11381: puppetmaster death spiral under passenger -- document the needs! https://projects.puppetlabs.com/issues/11381 Author: Jo Rhett Status: Needs More Information Priority: Normal Assignee: Jo Rhett Category: passenger Target version: Affected Puppet version: 2.6.12 Keywords: Branch: Having run a cfengine master server that handled 25k clients, I guess I should feel spoiled. But the apparent system requirements for puppetmaster are phenomenal. With a mere 500 nodes we have a dedicated machine with 4 cores, 8 GB of memory and 6GB of swap, and yet puppetmaster goes into a death spiral daily. There is nothing on this host other than apache, passenger and puppetmaster. (and nrpe/nagios test to ensure puppet client is running) This is what top looks like when it happens: <pre> top - 01:18:06 up 1 day, 1:53, 2 users, load average: 185.70, 148.74, 77.73 Tasks: 379 total, 181 running, 198 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 99.8%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.1%hi, 0.1%si, 0.0%st Mem: 8174508k total, 8132764k used, 41744k free, 524k buffers Swap: 6094840k total, 6094840k used, 0k free, 19784k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7938 puppet 18 0 216m 100m 648 R 43.0 1.3 0:02.65 ruby 31786 puppet 19 0 215m 107m 1724 R 34.1 1.3 2:46.71 ruby 364 root 15 0 0 0 0 S 13.2 0.0 1:21.89 pdflush 7868 puppet 19 0 217m 102m 648 R 11.4 1.3 0:05.21 ruby 8028 root 15 0 0 0 0 S 11.4 0.0 0:21.73 pdflush 7804 puppet 19 0 212m 96m 648 R 11.1 1.2 0:02.38 ruby 7802 puppet 18 0 243m 131m 840 R 7.4 1.6 0:06.40 ruby 7692 puppet 19 0 212m 16m 648 R 7.1 0.2 0:06.10 ruby 7573 puppet 18 0 210m 12m 648 R 6.1 0.2 0:13.12 ruby 7900 puppet 18 0 225m 111m 648 R 6.1 1.4 0:05.88 ruby 7926 puppet 19 0 215m 105m 648 R 6.1 1.3 0:03.42 ruby 7941 puppet 18 0 181m 79m 648 R 6.1 1.0 0:02.68 ruby 7561 puppet 18 0 200m 21m 648 R 5.8 0.3 0:13.21 ruby 7792 puppet 18 0 222m 113m 940 R 4.9 1.4 0:11.08 ruby 8113 root 19 0 102m 896 608 R 4.9 0.0 0:01.40 crond 7902 puppet 18 0 209m 100m 852 R 4.3 1.3 0:04.42 ruby 7429 puppet 18 0 207m 25m 648 R 4.0 0.3 0:10.24 ruby 31816 puppet 19 0 225m 117m 1652 R 4.0 1.5 2:28.63 ruby 7685 puppet 18 0 210m 19m 648 R 3.7 0.2 0:10.95 ruby 7918 puppet 18 0 215m 101m 648 R 3.7 1.3 0:03.52 ruby 8121 root 18 0 60476 1144 800 R 3.4 0.0 0:00.73 sshd 31825 puppet 18 0 220m 110m 1652 R 3.4 1.4 2:54.23 ruby 7417 puppet 19 0 198m 30m 648 R 3.1 0.4 0:10.72 ruby 7459 puppet 19 0 206m 17m 648 R 3.1 0.2 0:08.91 ruby 7479 puppet 19 0 199m 17m 648 R 3.1 0.2 0:09.01 ruby 7570 puppet 18 0 205m 19m 648 R 3.1 0.2 0:14.22 ruby 7576 puppet 19 0 212m 12m 648 R 3.1 0.2 0:08.61 ruby 7585 puppet 19 0 207m 18m 648 R 3.1 0.2 0:07.44 ruby 7589 puppet 19 0 204m 14m 648 R 3.1 0.2 0:07.00 ruby 7593 puppet 19 0 181m 81m 1548 R 3.1 1.0 0:37.07 ruby 7620 puppet 19 0 210m 17m 648 R 3.1 0.2 0:07.81 ruby 7625 puppet 19 0 209m 21m 648 R 3.1 0.3 0:08.22 ruby 7652 puppet 18 0 164m 10m 648 R 3.1 0.1 0:03.61 ruby 7656 puppet 19 0 213m 35m 648 R 3.1 0.5 0:18.16 ruby 7669 puppet 19 0 204m 23m 648 R 3.1 0.3 0:10.32 ruby 7672 puppet 19 0 207m 14m 648 R 3.1 0.2 0:06.61 ruby 7676 puppet 20 0 205m 17m 648 R 3.1 0.2 0:07.71 ruby 7708 puppet 18 0 208m 16m 648 R 3.1 0.2 0:04.46 ruby 7739 puppet 19 0 221m 14m 648 R 3.1 0.2 0:04.93 ruby 7743 puppet 19 0 212m 34m 648 R 3.1 0.4 0:04.51 ruby 7747 puppet 19 0 207m 25m 648 R 3.1 0.3 0:08.15 ruby 7794 puppet 19 0 213m 41m 648 R 3.1 0.5 0:07.06 ruby 7842 puppet 18 0 211m 100m 648 R 3.1 1.3 0:06.48 ruby 7850 puppet 19 0 212m 96m 852 R 3.1 1.2 0:05.51 ruby 7852 puppet 19 0 212m 95m 648 R 3.1 1.2 0:01.68 ruby 7855 puppet 19 0 209m 97m 924 R 3.1 1.2 0:10.06 ruby 7872 puppet 19 0 214m 97m 852 R 3.1 1.2 0:08.38 ruby </pre> 1. Passenger clients are limited to 20. Where did all these other ruby instances come from? (there is no other ruby code on the system) 2. Why is it willing to spawn until system death? How can I limit this? CentOS 5.7 with ruby 1.8.5 and all puppet packages from yum.puppetlabs.com Passenger 3.0.11 at the moment but we first saw this with passenger 2.2 and upgraded without any change in behavior. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here: http://projects.puppetlabs.com/my/account -- You received this message because you are subscribed to the Google Groups "Puppet Bugs" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/puppet-bugs?hl=en.
