[Puppet - Bug #11381] puppetmaster death spiral under passenger -- document the needs!

tickets Wed, 14 Dec 2011 11:26:21 -0800

Issue #11381 has been updated by Jo Rhett.


Eric Shamow wrote:
> If they're leaking memory they might.  The spare processes could be coming 
> from anywhere in that ecosystem - in fact Puppet is the least likely source.  
> Before you hit Puppet you've got Apache, Passenger, and Ruby itself all 
> trying to interact.

Well they run under the puppet user so I would expect that they are started by 
the puppetmaster running under passenger.  Apache wouldn't be starting ruby 
processes.  But yes this is what we should be investigating, and what I'd like 
to get to the bottom of.
 
> This is a feature you can certainly request (the ability to pass something 
> like --nosplay to the agent)

Already did.

>  If the issue is load when all the clients hit simultaneously, how does 
> turning off splay when you kick help you?  It'll likely bring your master to 
> its knees if it's struggling now.

We kick at most 25 hosts at a time, and usually 10.  What we want is for all 10 
to come into a new state very fast.  Given that puppetmaster is running with 
15-17 idle processes the vast majority of the time, there's more than enough 
headroom for that.  The problem is not 10 or 20 systems hitting in the same 
minute -- this works rather well.  The problem is 350 systems all hitting in 
the same minute.
 
> In general splay and kick address two different use cases - one for very 
> large scale, the other for targeted deployment.

Good -- because's that is exactly what we are looking for.  But right now splay 
prevents timely targeted deployment.  If splay would apply to the normal checks 
but be ignored for puppet kick, then it would meet your own description above.

>  The solution we've introduced as a middle ground is Mcollective, which is 
> not so much a separate technology as an addition to the Puppet ecosystem.

Great, but I don't have time to learn it while the puppetmaster is falling over 
and are tools aren't working.  Hearing that puppet isn't intended to work 
without mco is something that is likely to produce a fairly negative response 
here: why implement something else that may be equally broken?
 
> I understand about not having the time to learn Mcollective, but it's also 
> reasonable for us to say "you need to do things this way in order to achieve 
> your goal."  It's like driving your car in first gear...you may wish to avoid 
> learning how to shift into second, but then the scope of what you can 
> accomplish is limited.

That's not apples to apples.  If I buy a car with 6 gears in it, yes I need to 
learn to drive the car.  If there was something I wasn't doing right in the 
puppet deployment then I need to learn to fix it. (and there likely is!)  But 
if I buy a car and learn how to accelerate, but find that brakes are not 
included... it's a problem.  You're basically telling me that I need something 
beyond the car, not included in the car buying package, to make puppet work at 
all.
 
> Mcollective isn't orthogonal.  The more accurate analogy would be if the Ruby 
> developers said "try this library we provide which solves all your issues."  
> If you don't have time to learn that library, it's reasonable for the Ruby 
> developers to do their best to help you while reminding you that your 
> solution is non-optimal.

I totally agree with this.  But I have a political and organization constraints 
under which I can have the company invest in new technologies, and right now 
puppet isn't providing a lot of stable ground to stand on for investing in 
another technology.  There are competing groups within my team who want to use 
Windows-based technologies or just not use any such system and do it by hand.  
Saying "oh, this was never intended to work without also that" won't make a 
strong argument.

"This works great" and "this other thing will give us even more" are great 
arguments.  I'd intended to start the mco discussion from that stand.
 
> Puppet is known stable at a far larger scale than what you're dealing with, 
> so what we need to do is find a way to get your deployment functioning 
> properly.

I am aware of this.  This is why I am so baffled at a total empty silence when 
I ask the question: why are 350 systems all choosing the same moment to 
connect?  Someone at PL has got to know why this is.
 
> You are dealing with lots of moving parts, and it is reasonable when one of 
> those parts is known defective for a company to say "we won't work with you 
> to fix that part, but rather recommend replacing it with another" - 
> particularly when the new part is freely available.  Put simply, it isn't 
> worth the engineering effort to try and debug a memory issue with a new 
> Passenger, Apache, and Puppet against a version of Ruby that is ~ 6 years 
> old, when a supported and current version is easily available.

I have a constraint that I can only use RPMs from CentOS or Epel or specific 
approved vendors.  I got you added to the approved list, but you don't have a 
binary RPM.  I actually don't know RPM building, and nobody else on the team is 
very strong with it, so I need to build out the environment and build the RPM 
myself.  I expect to spend less than half a day on this (I'm going to build it 
carefully and documented for future stuff) but this keeps falling to lower 
priority than ongoing issues.
  --there were some RPMs built earlier by employees that broke systems badly, 
so I have to do this in a way that enhances confidence in building our own.

> You are absolutely free to operate outside the bounds of our recommended 
> versioning, but it's going to be difficult to find answers and support - not 
> just from us but also from the community - if you do so.
> If you'd like to give this a shot with a newer Ruby, I (or others here) will 
> happily assist you in getting things running.

I would love to. When I get done with current crisis, all of which I can't work 
on until puppetmaster is stable.  I'm trying to find time to build out the RPM 
infrastructure, but it doesn't get any love within the organization as far as 
priority.

Note: I prefer having our own repo, and on every other OS (BSD, Solaris, rhel3, 
etc) I've built this out and done it well. I just wasn't using Linux for the 
last 10 years so I need to relearn the proper ways and build out documentation 
and structure for long-term management.

----------------------------------------
Bug #11381: puppetmaster death spiral under passenger -- document the needs!
https://projects.puppetlabs.com/issues/11381

Author: Jo Rhett
Status: Needs More Information
Priority: Normal
Assignee: Jo Rhett
Category: passenger
Target version: 
Affected Puppet version: 2.6.12
Keywords: 
Branch: 


Having run a cfengine master server that handled 25k clients, I guess I should 
feel spoiled.  But the apparent system requirements for puppetmaster are 
phenomenal.  With a mere 500 nodes we have a dedicated machine with 4 cores, 8 
GB of memory and 6GB of swap, and yet puppetmaster goes into a death spiral 
daily.  There is nothing on this host other than apache, passenger and 
puppetmaster. (and nrpe/nagios test to ensure puppet client is running)

This is what top looks like when it happens:

<pre>
top - 01:18:06 up 1 day,  1:53,  2 users,  load average: 185.70, 148.74, 77.73
Tasks: 379 total, 181 running, 198 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us, 99.8%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.1%hi,  0.1%si,  0.0%st
Mem:   8174508k total,  8132764k used,    41744k free,      524k buffers
Swap:  6094840k total,  6094840k used,        0k free,    19784k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
                                           
 7938 puppet    18   0  216m 100m  648 R 43.0  1.3   0:02.65 ruby               
                                           
31786 puppet    19   0  215m 107m 1724 R 34.1  1.3   2:46.71 ruby               
                                           
  364 root      15   0     0    0    0 S 13.2  0.0   1:21.89 pdflush            
                                           
 7868 puppet    19   0  217m 102m  648 R 11.4  1.3   0:05.21 ruby               
                                           
 8028 root      15   0     0    0    0 S 11.4  0.0   0:21.73 pdflush            
                                           
 7804 puppet    19   0  212m  96m  648 R 11.1  1.2   0:02.38 ruby               
                                           
 7802 puppet    18   0  243m 131m  840 R  7.4  1.6   0:06.40 ruby               
                                           
 7692 puppet    19   0  212m  16m  648 R  7.1  0.2   0:06.10 ruby               
                                           
 7573 puppet    18   0  210m  12m  648 R  6.1  0.2   0:13.12 ruby               
                                           
 7900 puppet    18   0  225m 111m  648 R  6.1  1.4   0:05.88 ruby               
                                           
 7926 puppet    19   0  215m 105m  648 R  6.1  1.3   0:03.42 ruby               
                                           
 7941 puppet    18   0  181m  79m  648 R  6.1  1.0   0:02.68 ruby               
                                           
 7561 puppet    18   0  200m  21m  648 R  5.8  0.3   0:13.21 ruby               
                                           
 7792 puppet    18   0  222m 113m  940 R  4.9  1.4   0:11.08 ruby               
                                           
 8113 root      19   0  102m  896  608 R  4.9  0.0   0:01.40 crond              
                                           
 7902 puppet    18   0  209m 100m  852 R  4.3  1.3   0:04.42 ruby               
                                           
 7429 puppet    18   0  207m  25m  648 R  4.0  0.3   0:10.24 ruby               
                                           
31816 puppet    19   0  225m 117m 1652 R  4.0  1.5   2:28.63 ruby               
                                           
 7685 puppet    18   0  210m  19m  648 R  3.7  0.2   0:10.95 ruby               
                                           
 7918 puppet    18   0  215m 101m  648 R  3.7  1.3   0:03.52 ruby               
                                           
 8121 root      18   0 60476 1144  800 R  3.4  0.0   0:00.73 sshd               
                                           
31825 puppet    18   0  220m 110m 1652 R  3.4  1.4   2:54.23 ruby               
                                           
 7417 puppet    19   0  198m  30m  648 R  3.1  0.4   0:10.72 ruby               
                                           
 7459 puppet    19   0  206m  17m  648 R  3.1  0.2   0:08.91 ruby               
                                           
 7479 puppet    19   0  199m  17m  648 R  3.1  0.2   0:09.01 ruby               
                                           
 7570 puppet    18   0  205m  19m  648 R  3.1  0.2   0:14.22 ruby               
                                           
 7576 puppet    19   0  212m  12m  648 R  3.1  0.2   0:08.61 ruby               
                                           
 7585 puppet    19   0  207m  18m  648 R  3.1  0.2   0:07.44 ruby               
                                           
 7589 puppet    19   0  204m  14m  648 R  3.1  0.2   0:07.00 ruby               
                                           
 7593 puppet    19   0  181m  81m 1548 R  3.1  1.0   0:37.07 ruby               
                                           
 7620 puppet    19   0  210m  17m  648 R  3.1  0.2   0:07.81 ruby               
                                           
 7625 puppet    19   0  209m  21m  648 R  3.1  0.3   0:08.22 ruby               
                                           
 7652 puppet    18   0  164m  10m  648 R  3.1  0.1   0:03.61 ruby               
                                           
 7656 puppet    19   0  213m  35m  648 R  3.1  0.5   0:18.16 ruby               
                                           
 7669 puppet    19   0  204m  23m  648 R  3.1  0.3   0:10.32 ruby               
                                           
 7672 puppet    19   0  207m  14m  648 R  3.1  0.2   0:06.61 ruby               
                                           
 7676 puppet    20   0  205m  17m  648 R  3.1  0.2   0:07.71 ruby               
                                           
 7708 puppet    18   0  208m  16m  648 R  3.1  0.2   0:04.46 ruby               
                                           
 7739 puppet    19   0  221m  14m  648 R  3.1  0.2   0:04.93 ruby               
                                           
 7743 puppet    19   0  212m  34m  648 R  3.1  0.4   0:04.51 ruby               
                                           
 7747 puppet    19   0  207m  25m  648 R  3.1  0.3   0:08.15 ruby               
                                           
 7794 puppet    19   0  213m  41m  648 R  3.1  0.5   0:07.06 ruby               
                                           
 7842 puppet    18   0  211m 100m  648 R  3.1  1.3   0:06.48 ruby               
                                           
 7850 puppet    19   0  212m  96m  852 R  3.1  1.2   0:05.51 ruby               
                                           
 7852 puppet    19   0  212m  95m  648 R  3.1  1.2   0:01.68 ruby               
                                           
 7855 puppet    19   0  209m  97m  924 R  3.1  1.2   0:10.06 ruby               
                                           
 7872 puppet    19   0  214m  97m  852 R  3.1  1.2   0:08.38 ruby   
</pre>

1. Passenger clients are limited to 20.  Where did all these other ruby 
instances come from?  (there is no other ruby code on the system)

2. Why is it willing to spawn until system death?  How can I limit this?

CentOS 5.7 with ruby 1.8.5 and all puppet packages from yum.puppetlabs.com
Passenger 3.0.11 at the moment but we first saw this with passenger 2.2 and 
upgraded without any change in behavior.



-- 
You have received this notification because you have either subscribed to it, 
or are involved in it.
To change your notification preferences, please click here: 
http://projects.puppetlabs.com/my/account

-- 
You received this message because you are subscribed to the Google Groups 
"Puppet Bugs" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/puppet-bugs?hl=en.

[Puppet - Bug #11381] puppetmaster death spiral under passenger -- document the needs!

Reply via email to