[Puppet - Bug #11381] puppetmaster death spiral under passenger -- document the needs!

tickets Wed, 14 Dec 2011 07:11:30 -0800

Issue #11381 has been updated by Eric Shamow.


Hi Jo,

Jo Rhett wrote:

> 20 processes that are 104mb in size don't consume 8GB of memory.  The exact 
> size of the ruby processes is a known factor here. Yes, they might be bigger 
> than they would in 1.8.7, but that's not the nature of the problem.  The 
> question is where all these spare processes are coming from.
> 
> Imagine that they are half the size (I doubt it) in 1.8.7 -- if ruby spawns 
> processes out of control, it will still consume all available memory. Just 
> not as fast.

If they're leaking memory they might.  The spare processes could be coming from 
anywhere in that ecosystem - in fact Puppet is the least likely source.  Before 
you hit Puppet you've got Apache, Passenger, and Ruby itself all trying to 
interact.

> They receive kicks, but apply the splay before they act.  The net effect here 
> is that a large enough splay to prevent the massive jumps we see right now 
> (>5 minutes) means that the useful effect of "puppet kick" is reduced to 
> zero.   Splay is frankly improperly implemented, or rather is not the 
> solution for even load  on the puppetmaster.

This is a feature you can certainly request (the ability to pass something like 
--nosplay to the agent), but I'm not sure how it helps you.  If the issue is 
load when all the clients hit simultaneously, how does turning off splay when 
you kick help you?  It'll likely bring your master to its knees if it's 
struggling now.

In general splay and kick address two different use cases - one for very large 
scale, the other for targeted deployment.  The solution we've introduced as a 
middle ground is Mcollective, which is not so much a separate technology as an 
addition to the Puppet ecosystem.

> Whether I want to or not isn't the issue.  I don't have the time to invest in 
> learning MC because I have mission critical issues I should be working on, 
> but I'm not because puppetmaster is falling over and we've tied too many 
> tools to it.  The question being asked in my shop is no longer "what else can 
> puppet do" but "how can we take functionality out of puppet to avoid these 
> outages"

I understand about not having the time to learn Mcollective, but it's also 
reasonable for us to say "you need to do things this way in order to achieve 
your goal."  It's like driving your car in first gear...you may wish to avoid 
learning how to shift into second, but then the scope of what you can 
accomplish is limited.

> Imagine if you found an issue with ruby and opened a bug, and they said "oh 
> you should learn python -- that's what everyone doing that uses".  It's not 
> really an answer to the problem.  It's entirely orthogonal to the issue at 
> hand.  Implementing MC to fix puppet load balance is like implementing cron 
> to handle puppet balancing, it's just a workaround for the failure.  Why not 
> simply invest time in implementing cfengine?

Mcollective isn't orthogonal.  The more accurate analogy would be if the Ruby 
developers said "try this library we provide which solves all your issues."  If 
you don't have time to learn that library, it's reasonable for the Ruby 
developers to do their best to help you while reminding you that your solution 
is non-optimal.

> --not trying to be nasty, it's actually an honest question.  If you have no 
> solution to Puppet's issues but to implement a different framework, why not 
> implement a different framework that is known and stable without tens of 
> thousands of clients?  Not trying to bash you, but to point out why your 
> current answer isn't going to encourage anyone.

Puppet is known stable at a far larger scale than what you're dealing with, so 
what we need to do is find a way to get your deployment functioning properly.

You are dealing with lots of moving parts, and it is reasonable when one of 
those parts is known defective for a company to say "we won't work with you to 
fix that part, but rather recommend replacing it with another" - particularly 
when the new part is freely available.  Put simply, it isn't worth the 
engineering effort to try and debug a memory issue with a new Passenger, 
Apache, and Puppet against a version of Ruby that is ~ 6 years old, when a 
supported and current version is easily available.

You are absolutely free to operate outside the bounds of our recommended 
versioning, but it's going to be difficult to find answers and support - not 
just from us but also from the community - if you do so.

If you'd like to give this a shot with a newer Ruby, I (or others here) will 
happily assist you in getting things running.

-Eric
----------------------------------------
Bug #11381: puppetmaster death spiral under passenger -- document the needs!
https://projects.puppetlabs.com/issues/11381

Author: Jo Rhett
Status: Needs More Information
Priority: Normal
Assignee: Jo Rhett
Category: passenger
Target version: 
Affected Puppet version: 2.6.12
Keywords: 
Branch: 


Having run a cfengine master server that handled 25k clients, I guess I should 
feel spoiled.  But the apparent system requirements for puppetmaster are 
phenomenal.  With a mere 500 nodes we have a dedicated machine with 4 cores, 8 
GB of memory and 6GB of swap, and yet puppetmaster goes into a death spiral 
daily.  There is nothing on this host other than apache, passenger and 
puppetmaster. (and nrpe/nagios test to ensure puppet client is running)

This is what top looks like when it happens:

<pre>
top - 01:18:06 up 1 day,  1:53,  2 users,  load average: 185.70, 148.74, 77.73
Tasks: 379 total, 181 running, 198 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us, 99.8%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.1%hi,  0.1%si,  0.0%st
Mem:   8174508k total,  8132764k used,    41744k free,      524k buffers
Swap:  6094840k total,  6094840k used,        0k free,    19784k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
                                           
 7938 puppet    18   0  216m 100m  648 R 43.0  1.3   0:02.65 ruby               
                                           
31786 puppet    19   0  215m 107m 1724 R 34.1  1.3   2:46.71 ruby               
                                           
  364 root      15   0     0    0    0 S 13.2  0.0   1:21.89 pdflush            
                                           
 7868 puppet    19   0  217m 102m  648 R 11.4  1.3   0:05.21 ruby               
                                           
 8028 root      15   0     0    0    0 S 11.4  0.0   0:21.73 pdflush            
                                           
 7804 puppet    19   0  212m  96m  648 R 11.1  1.2   0:02.38 ruby               
                                           
 7802 puppet    18   0  243m 131m  840 R  7.4  1.6   0:06.40 ruby               
                                           
 7692 puppet    19   0  212m  16m  648 R  7.1  0.2   0:06.10 ruby               
                                           
 7573 puppet    18   0  210m  12m  648 R  6.1  0.2   0:13.12 ruby               
                                           
 7900 puppet    18   0  225m 111m  648 R  6.1  1.4   0:05.88 ruby               
                                           
 7926 puppet    19   0  215m 105m  648 R  6.1  1.3   0:03.42 ruby               
                                           
 7941 puppet    18   0  181m  79m  648 R  6.1  1.0   0:02.68 ruby               
                                           
 7561 puppet    18   0  200m  21m  648 R  5.8  0.3   0:13.21 ruby               
                                           
 7792 puppet    18   0  222m 113m  940 R  4.9  1.4   0:11.08 ruby               
                                           
 8113 root      19   0  102m  896  608 R  4.9  0.0   0:01.40 crond              
                                           
 7902 puppet    18   0  209m 100m  852 R  4.3  1.3   0:04.42 ruby               
                                           
 7429 puppet    18   0  207m  25m  648 R  4.0  0.3   0:10.24 ruby               
                                           
31816 puppet    19   0  225m 117m 1652 R  4.0  1.5   2:28.63 ruby               
                                           
 7685 puppet    18   0  210m  19m  648 R  3.7  0.2   0:10.95 ruby               
                                           
 7918 puppet    18   0  215m 101m  648 R  3.7  1.3   0:03.52 ruby               
                                           
 8121 root      18   0 60476 1144  800 R  3.4  0.0   0:00.73 sshd               
                                           
31825 puppet    18   0  220m 110m 1652 R  3.4  1.4   2:54.23 ruby               
                                           
 7417 puppet    19   0  198m  30m  648 R  3.1  0.4   0:10.72 ruby               
                                           
 7459 puppet    19   0  206m  17m  648 R  3.1  0.2   0:08.91 ruby               
                                           
 7479 puppet    19   0  199m  17m  648 R  3.1  0.2   0:09.01 ruby               
                                           
 7570 puppet    18   0  205m  19m  648 R  3.1  0.2   0:14.22 ruby               
                                           
 7576 puppet    19   0  212m  12m  648 R  3.1  0.2   0:08.61 ruby               
                                           
 7585 puppet    19   0  207m  18m  648 R  3.1  0.2   0:07.44 ruby               
                                           
 7589 puppet    19   0  204m  14m  648 R  3.1  0.2   0:07.00 ruby               
                                           
 7593 puppet    19   0  181m  81m 1548 R  3.1  1.0   0:37.07 ruby               
                                           
 7620 puppet    19   0  210m  17m  648 R  3.1  0.2   0:07.81 ruby               
                                           
 7625 puppet    19   0  209m  21m  648 R  3.1  0.3   0:08.22 ruby               
                                           
 7652 puppet    18   0  164m  10m  648 R  3.1  0.1   0:03.61 ruby               
                                           
 7656 puppet    19   0  213m  35m  648 R  3.1  0.5   0:18.16 ruby               
                                           
 7669 puppet    19   0  204m  23m  648 R  3.1  0.3   0:10.32 ruby               
                                           
 7672 puppet    19   0  207m  14m  648 R  3.1  0.2   0:06.61 ruby               
                                           
 7676 puppet    20   0  205m  17m  648 R  3.1  0.2   0:07.71 ruby               
                                           
 7708 puppet    18   0  208m  16m  648 R  3.1  0.2   0:04.46 ruby               
                                           
 7739 puppet    19   0  221m  14m  648 R  3.1  0.2   0:04.93 ruby               
                                           
 7743 puppet    19   0  212m  34m  648 R  3.1  0.4   0:04.51 ruby               
                                           
 7747 puppet    19   0  207m  25m  648 R  3.1  0.3   0:08.15 ruby               
                                           
 7794 puppet    19   0  213m  41m  648 R  3.1  0.5   0:07.06 ruby               
                                           
 7842 puppet    18   0  211m 100m  648 R  3.1  1.3   0:06.48 ruby               
                                           
 7850 puppet    19   0  212m  96m  852 R  3.1  1.2   0:05.51 ruby               
                                           
 7852 puppet    19   0  212m  95m  648 R  3.1  1.2   0:01.68 ruby               
                                           
 7855 puppet    19   0  209m  97m  924 R  3.1  1.2   0:10.06 ruby               
                                           
 7872 puppet    19   0  214m  97m  852 R  3.1  1.2   0:08.38 ruby   
</pre>

1. Passenger clients are limited to 20.  Where did all these other ruby 
instances come from?  (there is no other ruby code on the system)

2. Why is it willing to spawn until system death?  How can I limit this?

CentOS 5.7 with ruby 1.8.5 and all puppet packages from yum.puppetlabs.com
Passenger 3.0.11 at the moment but we first saw this with passenger 2.2 and 
upgraded without any change in behavior.



-- 
You have received this notification because you have either subscribed to it, 
or are involved in it.
To change your notification preferences, please click here: 
http://projects.puppetlabs.com/my/account

-- 
You received this message because you are subscribed to the Google Groups 
"Puppet Bugs" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/puppet-bugs?hl=en.

[Puppet - Bug #11381] puppetmaster death spiral under passenger -- document the needs!

Reply via email to