Issue #11381 has been updated by Jo Rhett.
Eric Shamow wrote: > If they're leaking memory they might. The spare processes could be coming > from anywhere in that ecosystem - in fact Puppet is the least likely source. > Before you hit Puppet you've got Apache, Passenger, and Ruby itself all > trying to interact. Well they run under the puppet user so I would expect that they are started by the puppetmaster running under passenger. Apache wouldn't be starting ruby processes. But yes this is what we should be investigating, and what I'd like to get to the bottom of. > This is a feature you can certainly request (the ability to pass something > like --nosplay to the agent) Already did. > If the issue is load when all the clients hit simultaneously, how does > turning off splay when you kick help you? It'll likely bring your master to > its knees if it's struggling now. We kick at most 25 hosts at a time, and usually 10. What we want is for all 10 to come into a new state very fast. Given that puppetmaster is running with 15-17 idle processes the vast majority of the time, there's more than enough headroom for that. The problem is not 10 or 20 systems hitting in the same minute -- this works rather well. The problem is 350 systems all hitting in the same minute. > In general splay and kick address two different use cases - one for very > large scale, the other for targeted deployment. Good -- because's that is exactly what we are looking for. But right now splay prevents timely targeted deployment. If splay would apply to the normal checks but be ignored for puppet kick, then it would meet your own description above. > The solution we've introduced as a middle ground is Mcollective, which is > not so much a separate technology as an addition to the Puppet ecosystem. Great, but I don't have time to learn it while the puppetmaster is falling over and are tools aren't working. Hearing that puppet isn't intended to work without mco is something that is likely to produce a fairly negative response here: why implement something else that may be equally broken? > I understand about not having the time to learn Mcollective, but it's also > reasonable for us to say "you need to do things this way in order to achieve > your goal." It's like driving your car in first gear...you may wish to avoid > learning how to shift into second, but then the scope of what you can > accomplish is limited. That's not apples to apples. If I buy a car with 6 gears in it, yes I need to learn to drive the car. If there was something I wasn't doing right in the puppet deployment then I need to learn to fix it. (and there likely is!) But if I buy a car and learn how to accelerate, but find that brakes are not included... it's a problem. You're basically telling me that I need something beyond the car, not included in the car buying package, to make puppet work at all. > Mcollective isn't orthogonal. The more accurate analogy would be if the Ruby > developers said "try this library we provide which solves all your issues." > If you don't have time to learn that library, it's reasonable for the Ruby > developers to do their best to help you while reminding you that your > solution is non-optimal. I totally agree with this. But I have a political and organization constraints under which I can have the company invest in new technologies, and right now puppet isn't providing a lot of stable ground to stand on for investing in another technology. There are competing groups within my team who want to use Windows-based technologies or just not use any such system and do it by hand. Saying "oh, this was never intended to work without also that" won't make a strong argument. "This works great" and "this other thing will give us even more" are great arguments. I'd intended to start the mco discussion from that stand. > Puppet is known stable at a far larger scale than what you're dealing with, > so what we need to do is find a way to get your deployment functioning > properly. I am aware of this. This is why I am so baffled at a total empty silence when I ask the question: why are 350 systems all choosing the same moment to connect? Someone at PL has got to know why this is. > You are dealing with lots of moving parts, and it is reasonable when one of > those parts is known defective for a company to say "we won't work with you > to fix that part, but rather recommend replacing it with another" - > particularly when the new part is freely available. Put simply, it isn't > worth the engineering effort to try and debug a memory issue with a new > Passenger, Apache, and Puppet against a version of Ruby that is ~ 6 years > old, when a supported and current version is easily available. I have a constraint that I can only use RPMs from CentOS or Epel or specific approved vendors. I got you added to the approved list, but you don't have a binary RPM. I actually don't know RPM building, and nobody else on the team is very strong with it, so I need to build out the environment and build the RPM myself. I expect to spend less than half a day on this (I'm going to build it carefully and documented for future stuff) but this keeps falling to lower priority than ongoing issues. --there were some RPMs built earlier by employees that broke systems badly, so I have to do this in a way that enhances confidence in building our own. > You are absolutely free to operate outside the bounds of our recommended > versioning, but it's going to be difficult to find answers and support - not > just from us but also from the community - if you do so. > If you'd like to give this a shot with a newer Ruby, I (or others here) will > happily assist you in getting things running. I would love to. When I get done with current crisis, all of which I can't work on until puppetmaster is stable. I'm trying to find time to build out the RPM infrastructure, but it doesn't get any love within the organization as far as priority. Note: I prefer having our own repo, and on every other OS (BSD, Solaris, rhel3, etc) I've built this out and done it well. I just wasn't using Linux for the last 10 years so I need to relearn the proper ways and build out documentation and structure for long-term management. ---------------------------------------- Bug #11381: puppetmaster death spiral under passenger -- document the needs! https://projects.puppetlabs.com/issues/11381 Author: Jo Rhett Status: Needs More Information Priority: Normal Assignee: Jo Rhett Category: passenger Target version: Affected Puppet version: 2.6.12 Keywords: Branch: Having run a cfengine master server that handled 25k clients, I guess I should feel spoiled. But the apparent system requirements for puppetmaster are phenomenal. With a mere 500 nodes we have a dedicated machine with 4 cores, 8 GB of memory and 6GB of swap, and yet puppetmaster goes into a death spiral daily. There is nothing on this host other than apache, passenger and puppetmaster. (and nrpe/nagios test to ensure puppet client is running) This is what top looks like when it happens: <pre> top - 01:18:06 up 1 day, 1:53, 2 users, load average: 185.70, 148.74, 77.73 Tasks: 379 total, 181 running, 198 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 99.8%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.1%hi, 0.1%si, 0.0%st Mem: 8174508k total, 8132764k used, 41744k free, 524k buffers Swap: 6094840k total, 6094840k used, 0k free, 19784k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7938 puppet 18 0 216m 100m 648 R 43.0 1.3 0:02.65 ruby 31786 puppet 19 0 215m 107m 1724 R 34.1 1.3 2:46.71 ruby 364 root 15 0 0 0 0 S 13.2 0.0 1:21.89 pdflush 7868 puppet 19 0 217m 102m 648 R 11.4 1.3 0:05.21 ruby 8028 root 15 0 0 0 0 S 11.4 0.0 0:21.73 pdflush 7804 puppet 19 0 212m 96m 648 R 11.1 1.2 0:02.38 ruby 7802 puppet 18 0 243m 131m 840 R 7.4 1.6 0:06.40 ruby 7692 puppet 19 0 212m 16m 648 R 7.1 0.2 0:06.10 ruby 7573 puppet 18 0 210m 12m 648 R 6.1 0.2 0:13.12 ruby 7900 puppet 18 0 225m 111m 648 R 6.1 1.4 0:05.88 ruby 7926 puppet 19 0 215m 105m 648 R 6.1 1.3 0:03.42 ruby 7941 puppet 18 0 181m 79m 648 R 6.1 1.0 0:02.68 ruby 7561 puppet 18 0 200m 21m 648 R 5.8 0.3 0:13.21 ruby 7792 puppet 18 0 222m 113m 940 R 4.9 1.4 0:11.08 ruby 8113 root 19 0 102m 896 608 R 4.9 0.0 0:01.40 crond 7902 puppet 18 0 209m 100m 852 R 4.3 1.3 0:04.42 ruby 7429 puppet 18 0 207m 25m 648 R 4.0 0.3 0:10.24 ruby 31816 puppet 19 0 225m 117m 1652 R 4.0 1.5 2:28.63 ruby 7685 puppet 18 0 210m 19m 648 R 3.7 0.2 0:10.95 ruby 7918 puppet 18 0 215m 101m 648 R 3.7 1.3 0:03.52 ruby 8121 root 18 0 60476 1144 800 R 3.4 0.0 0:00.73 sshd 31825 puppet 18 0 220m 110m 1652 R 3.4 1.4 2:54.23 ruby 7417 puppet 19 0 198m 30m 648 R 3.1 0.4 0:10.72 ruby 7459 puppet 19 0 206m 17m 648 R 3.1 0.2 0:08.91 ruby 7479 puppet 19 0 199m 17m 648 R 3.1 0.2 0:09.01 ruby 7570 puppet 18 0 205m 19m 648 R 3.1 0.2 0:14.22 ruby 7576 puppet 19 0 212m 12m 648 R 3.1 0.2 0:08.61 ruby 7585 puppet 19 0 207m 18m 648 R 3.1 0.2 0:07.44 ruby 7589 puppet 19 0 204m 14m 648 R 3.1 0.2 0:07.00 ruby 7593 puppet 19 0 181m 81m 1548 R 3.1 1.0 0:37.07 ruby 7620 puppet 19 0 210m 17m 648 R 3.1 0.2 0:07.81 ruby 7625 puppet 19 0 209m 21m 648 R 3.1 0.3 0:08.22 ruby 7652 puppet 18 0 164m 10m 648 R 3.1 0.1 0:03.61 ruby 7656 puppet 19 0 213m 35m 648 R 3.1 0.5 0:18.16 ruby 7669 puppet 19 0 204m 23m 648 R 3.1 0.3 0:10.32 ruby 7672 puppet 19 0 207m 14m 648 R 3.1 0.2 0:06.61 ruby 7676 puppet 20 0 205m 17m 648 R 3.1 0.2 0:07.71 ruby 7708 puppet 18 0 208m 16m 648 R 3.1 0.2 0:04.46 ruby 7739 puppet 19 0 221m 14m 648 R 3.1 0.2 0:04.93 ruby 7743 puppet 19 0 212m 34m 648 R 3.1 0.4 0:04.51 ruby 7747 puppet 19 0 207m 25m 648 R 3.1 0.3 0:08.15 ruby 7794 puppet 19 0 213m 41m 648 R 3.1 0.5 0:07.06 ruby 7842 puppet 18 0 211m 100m 648 R 3.1 1.3 0:06.48 ruby 7850 puppet 19 0 212m 96m 852 R 3.1 1.2 0:05.51 ruby 7852 puppet 19 0 212m 95m 648 R 3.1 1.2 0:01.68 ruby 7855 puppet 19 0 209m 97m 924 R 3.1 1.2 0:10.06 ruby 7872 puppet 19 0 214m 97m 852 R 3.1 1.2 0:08.38 ruby </pre> 1. Passenger clients are limited to 20. Where did all these other ruby instances come from? (there is no other ruby code on the system) 2. Why is it willing to spawn until system death? How can I limit this? CentOS 5.7 with ruby 1.8.5 and all puppet packages from yum.puppetlabs.com Passenger 3.0.11 at the moment but we first saw this with passenger 2.2 and upgraded without any change in behavior. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here: http://projects.puppetlabs.com/my/account -- You received this message because you are subscribed to the Google Groups "Puppet Bugs" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/puppet-bugs?hl=en.
