Re: [Puppet Users] Re: Puppetmaster can't keep up with our 1400 nodes.

Georgi Todorov Mon, 03 Nov 2014 08:39:58 -0800

On Friday, October 31, 2014 9:50:41 AM UTC-4, Georgi Todorov wrote:
>
>  Actually, sometime last night something happened and puppet stopped 
> processing requests altogether. Stopping and starting httpd fixed this, but 
> this could be just some bug in one of the new versions of software I 
> upgraded to. I'll keep monitoring.
>


So, unfortunately issue is not fixed :(. For whatever reason, everything 
ran great for a day. Catalog compiles were taking around 7 seconds, client 
runs finished in about 20s - happy days. Then overnight, the catalog 
compile times jumped to 20-30 seconds and client runs were now taking 200+ 
seconds. Few hours later, and there would be no more requests arriving at 
the puppet master at all. Is my http server flaking out? 

Running some --trace --evaltrace and strace it looks like most of the time 
is spent stat-ing:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 83.01    5.743474           9    673606    612864 stat
  7.72    0.534393           7     72102     71510 lstat
  6.76    0.467930       77988         6           wait4

That's a pretty poor "hit" rate (7k out of 74k stats)...

I've increased the check time to 1 hour on all clients, and the master 
seems to be keeping up for now - catalog compile avg 8 seconds, client run 
avg - 15 seconds, queue size = 0;

 Here is what a client run looks like when the server is keeping up:

Notice: Finished catalog run in *11.93* seconds
Changes:
Events:
Resources:
            Total: 522
Time:
       Filebucket: 0.00
             Cron: 0.00
         Schedule: 0.00
          Package: 0.00
          Service: 0.68
             Exec: 1.07
             *File: 1.72*
   Config retrieval: 13.35
         Last run: 1415032387
            Total: 16.82
Version:
           Config: 1415031292
           Puppet: 3.7.2


And when the server is just about dead:
Notice: Finished catalog run in 214.21 seconds
Changes:
Events:
Resources:
            Total: 522
Time:
             Cron: 0.00
       Filebucket: 0.00
         Schedule: 0.01
          Package: 0.02
          Service: 1.19
             File: 128.94
         Last run: 1415027092
            Total: 159.21
             Exec: 2.25
   Config retrieval: 26.80
Version:
           Config: 1415025705
           Puppet: 3.7.2


Probably 500 of the "Resources" are autofs maps 
using https://github.com/pdxcat/puppet-module-autofs/commits/master 

So there is definitely some bottle neck on the system, the problem is I 
can't figure out what it is. Is disk IO (iostat doesn't seem to think so), 
is it CPU (top looks fine), is it memory (ditto), is http/passenger combo 
not up to the task, is the postgres server not keeping up? There are so 
many components that it is hard for me to do a proper profile to find where 
the bottleneck is. Any ideas?

So far I've timed  the ENC script that pulls the classes for a node - takes 
less than 1 second. 
>From messages the catalog compile is from 7 seconds to 25 seconds (worst 
case, overloaded server). 

Anyway, figured I'd share that, unfortunately ruby was not the issue. Back 
to poking around and testing.

-- 
You received this message because you are subscribed to the Google Groups 
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/puppet-users/3549a96b-74e2-4335-90a9-3aa6f8f74699%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [Puppet Users] Re: Puppetmaster can't keep up with our 1400 nodes.

Reply via email to