I tried removing the cert from puppet master for one of the three systems that can't get catalogs, and removing the entire /var/lib/puppet directory on the client but got the exact same response. After getting a new cert and signing it, the client accepted the cert and then hung waiting for catalog. The process on the puppetmaster hung.
Anything else I could test, check, clear…? --debug --trace on the client simply shows me a timeout, no extra detail. On Dec 2, 2011, at 3:32 PM, Jo Rhett wrote: > I am also now pretty certain that this issue (ticket #11140) is tied directly > to 3 systems (in ticket #11143) which can't get catalogs. I believe their > attempts to get a catalog produce a hung server. 3 servers every 30 minutes > means that in just over 3 hours I have 20 hung puppetmasters, and the queue > goes out of control. > > I would deeply appreciate some information on how to diagnose the catalog > failures and related puppetmaster hangs. > > On Dec 2, 2011, at 3:09 PM, Jo Rhett wrote: >> Hm, you know I don't think that it's a sudden lock of all 20 passenger >> clients. I think it's a slow lockup of various puppet sessions until all 20 >> are locked. Here's an example: every one of the "active" sessions below >> with an uptime longer than 30 minutes has had the same "processed" number >> for more than 30 minutes at this time. So in theory, they've been >> processing the same session for more than 30 minutes. Somehow, I don't >> think so. I think those sessions are locked up. And what is happening is >> that eventually all 20 processes are hung and we are dead in the water. >> >> Fri Dec 2 23:05:59 UTC 2011 >> ----------- General information ----------- >> max = 20 >> count = 18 >> active = 12 >> inactive = 6 >> Waiting on global queue: 0 >> >> ----------- Domains ----------- >> /etc/puppet/rack: >> PID: 21021 Sessions: 0 Processed: 362 Uptime: 5m 37s >> PID: 21005 Sessions: 0 Processed: 537 Uptime: 5m 38s >> PID: 21555 Sessions: 0 Processed: 69 Uptime: 30s >> PID: 21571 Sessions: 0 Processed: 62 Uptime: 29s >> PID: 20989 Sessions: 0 Processed: 209 Uptime: 5m 39s >> PID: 20968 Sessions: 0 Processed: 157 Uptime: 5m 41s >> PID: 9221 Sessions: 1 Processed: 903 Uptime: 2h 5m 55s >> PID: 9340 Sessions: 1 Processed: 764 Uptime: 2h 4m 58s >> PID: 10379 Sessions: 1 Processed: 568 Uptime: 1h 57m 37s >> PID: 11847 Sessions: 1 Processed: 712 Uptime: 1h 41m 13s >> PID: 11686 Sessions: 1 Processed: 314 Uptime: 1h 41m 19s >> PID: 10845 Sessions: 1 Processed: 511 Uptime: 1h 48m 52s >> PID: 11650 Sessions: 1 Processed: 747 Uptime: 1h 41m 21s >> PID: 14967 Sessions: 1 Processed: 84 Uptime: 1h 8m 28s >> PID: 17605 Sessions: 1 Processed: 497 Uptime: 44m 41s >> PID: 20342 Sessions: 1 Processed: 0 Uptime: 13m 14s >> PID: 20358 Sessions: 1 Processed: 54 Uptime: 13m 13s >> PID: 18098 Sessions: 1 Processed: 854 Uptime: 35m 46s >> >> On Dec 2, 2011, at 2:22 PM, Jo Rhett wrote: >> >>> On Dec 2, 2011, at 1:30 PM, Nigel Kersten wrote: >>>> On Fri, Dec 2, 2011 at 1:03 PM, Jo Rhett <[email protected]> wrote: >>>> Okay, this has happened again. Puppet master stopped logging catalog >>>> compiles, every server stopped returning results and the global queue went >>>> quickly through the roof in like 9 minutes. It appears puppet master is >>>> stopping dead in its tracks without logging any errors. >>>> >>>> A really quick test would be to start a webrick puppetmaster on an >>>> alternate port with the same configuration file in debug mode and then >>>> puppet against it to see if there's a problem at that level, >>>> >>>> (on master) >>>> puppet master --no-daemonize --verbose --debug --masterport 9140 (for >>>> example) >>>> >>>> (on an agent) >>>> puppet agent --test --masterport 9140 >>> >>> This works perfectly fine. >>> >>>> If that doesn't show anything, let us know whether you're running Apache >>>> prefork or worker, and your relevant pool regulation settings like: >>>> >>>> StartServers >>>> MinSpareServers >>>> MaxSpareServers >>>> ServerLimit >>>> MaxClients >>>> MaxRequestsPerChild >>> >>> pre fork with the following settings: >>> >>> StartServers 8 >>> MinSpareServers 5 >>> MaxSpareServers 20 >>> ServerLimit 256 >>> MaxClients 256 >>> MaxRequestsPerChild 4000 >>> >>>> # passenger-status >>>> ----------- General information ----------- >>>> max = 20 >>>> count = 20 >>>> active = 20 >>>> inactive = 0 >>>> Waiting on global queue: 209 >>>> >>>> ----------- Domains ----------- >>>> /etc/puppet/rack: >>>> PID: 25783 Sessions: 1 Processed: 329 Uptime: 2h 52m 7s >>>> PID: 25831 Sessions: 1 Processed: 4 Uptime: 2h 52m 5s >>>> PID: 28517 Sessions: 1 Processed: 6 Uptime: 2h 22m 0s >>>> PID: 25802 Sessions: 1 Processed: 714 Uptime: 2h 52m 6s >>>> PID: 30905 Sessions: 1 Processed: 13 Uptime: 1h 50m 27s >>>> PID: 25864 Sessions: 1 Processed: 709 Uptime: 2h 52m 4s >>>> PID: 31028 Sessions: 1 Processed: 347 Uptime: 1h 50m 21s >>>> PID: 28944 Sessions: 1 Processed: 377 Uptime: 2h 21m 50s >>>> PID: 31090 Sessions: 1 Processed: 266 Uptime: 1h 50m 18s >>>> PID: 577 Sessions: 1 Processed: 400 Uptime: 1h 27m 27s >>>> PID: 418 Sessions: 1 Processed: 647 Uptime: 1h 28m 2s >>>> PID: 1247 Sessions: 1 Processed: 133 Uptime: 1h 19m 3s >>>> PID: 1474 Sessions: 1 Processed: 52 Uptime: 1h 18m 9s >>>> PID: 594 Sessions: 1 Processed: 378 Uptime: 1h 27m 26s >>>> PID: 4706 Sessions: 1 Processed: 414 Uptime: 48m 5s >>>> PID: 4775 Sessions: 1 Processed: 218 Uptime: 47m 28s >>>> PID: 4854 Sessions: 1 Processed: 584 Uptime: 47m 23s >>>> PID: 7774 Sessions: 1 Processed: 165 Uptime: 14m 27s >>>> PID: 7902 Sessions: 1 Processed: 44 Uptime: 13m 44s >>>> PID: 8149 Sessions: 1 Processed: 541 Uptime: 11m 21s >>>> >>>> >>>> On Dec 2, 2011, at 10:58 AM, Jo Rhett wrote: >>>>> I came in this morning to find all the servers all locked up solid: >>>>> >>>>> # passenger-status >>>>> ----------- General information ----------- >>>>> max = 20 >>>>> count = 20 >>>>> active = 20 >>>>> inactive = 0 >>>>> Waiting on global queue: 236 >>>>> >>>>> ----------- Domains ----------- >>>>> /etc/puppet/rack: >>>>> PID: 2720 Sessions: 1 Processed: 939 Uptime: 9h 22m 18s >>>>> PID: 1615 Sessions: 1 Processed: 947 Uptime: 9h 23m 14s >>>>> PID: 1596 Sessions: 1 Processed: 607 Uptime: 9h 23m 15s >>>>> PID: 1722 Sessions: 1 Processed: 953 Uptime: 9h 23m 9s >>>>> PID: 2218 Sessions: 1 Processed: 378 Uptime: 9h 22m 43s >>>>> PID: 4286 Sessions: 1 Processed: 178 Uptime: 8h 50m 58s >>>>> PID: 5749 Sessions: 1 Processed: 708 Uptime: 8h 20m 20s >>>>> PID: 4253 Sessions: 1 Processed: 820 Uptime: 8h 51m 1s >>>>> PID: 5624 Sessions: 1 Processed: 126 Uptime: 8h 20m 24s >>>>> PID: 7328 Sessions: 1 Processed: 811 Uptime: 7h 49m 17s >>>>> PID: 7274 Sessions: 1 Processed: 984 Uptime: 7h 49m 20s >>>>> PID: 8761 Sessions: 1 Processed: 85 Uptime: 7h 18m 50s >>>>> PID: 9135 Sessions: 1 Processed: 907 Uptime: 7h 16m 27s >>>>> PID: 8777 Sessions: 1 Processed: 342 Uptime: 7h 18m 49s >>>>> PID: 10508 Sessions: 1 Processed: 51 Uptime: 6h 47m 6s >>>>> PID: 10853 Sessions: 1 Processed: 603 Uptime: 6h 43m 9s >>>>> PID: 10620 Sessions: 1 Processed: 939 Uptime: 6h 45m 52s >>>>> PID: 11438 Sessions: 1 Processed: 870 Uptime: 6h 30m 8s >>>>> PID: 12582 Sessions: 1 Processed: 448 Uptime: 6h 9m 59s >>>>> PID: 12670 Sessions: 1 Processed: 400 Uptime: 6h 8m 46s >>>>> >>>>> For comparison, most of our server processes recycle within 20 minutes >>>>> normally, as they hit 1000 really fast. >>>>> >>>>> # you probably want to tune these settings >>>>> PassengerHighPerformance on >>>>> PassengerUseGlobalQueue on >>>>> PassengerMaxPoolSize 20 >>>>> PassengerPoolIdleTime 1800 >>>>> PassengerMaxRequests 1000 >>>>> #PassengerStatThrottleRate 120 >>>>> RackAutoDetect Off >>>>> RailsAutoDetect Off >>>>> >>>>> There is nothing useful in the system logs. They just stopped: >>>>> >>>>> Dec 2 12:06:34 axxats003 puppet-master[12670]: Compiled catalog for >>>>> axxamx001.sjc.company.com in environment production >>>>> in 1.76 seconds >>>>> Dec 2 12:06:37 axxats003 puppet-master[12670]: Compiled catalog for >>>>> axxatn016.sjc.company.com in environment production >>>>> in 1.64 seconds >>>>> Dec 2 12:06:40 axxats003 puppet-master[12670]: Compiled catalog for >>>>> axaafc001.company.com in environment production i >>>>> n 1.70 seconds >>>>> Dec 2 14:10:02 axxats003 puppet-agent[16965]: Reopening log files >>>>> Dec 2 14:10:02 axxats003 puppet-agent[16965]: Starting Puppet client >>>>> version 2.6.12 >>>>> Dec 2 14:12:04 axxats003 puppet-agent[16965]: Could not retrieve catalog >>>>> from remote server: execution expired >>>>> Dec 2 14:12:04 axxats003 puppet-agent[16965]: Using cached catalog >>>>> >>>>> (every 30 minutes puppet agent says the same thing until I restart the >>>>> puppet master) >>>>> >>>>> Dec 2 18:06:09 axxats003 puppet-master[25783]: Starting Puppet master >>>>> version 2.6.12 >>>>> Dec 2 18:06:10 axxats003 puppet-master[25802]: Starting Puppet master >>>>> version 2.6.12 >>>>> Dec 2 18:06:11 axxats003 puppet-master[25831]: Starting Puppet master >>>>> version 2.6.12 >>>>> Dec 2 18:06:12 axxats003 puppet-master[25864]: Starting Puppet master >>>>> version 2.6.12 >>>>> Dec 2 18:06:13 axxats003 puppet-master[25897]: Starting Puppet master >>>>> version 2.6.12 >>>>> Dec 2 18:06:14 axxats003 puppet-master[25922]: Starting Puppet master >>>>> version 2.6.12 >>>>> Dec 2 18:06:15 axxats003 puppet-master[25947]: Starting Puppet master >>>>> version 2.6.12 >>>>> Dec 2 18:06:16 axxats003 puppet-master[25972]: Starting Puppet master >>>>> version 2.6.12 >>>>> Dec 2 18:06:17 axxats003 puppet-master[25997]: Starting Puppet master >>>>> version 2.6.12 >>>>> Dec 2 18:06:18 axxats003 puppet-master[26019]: Starting Puppet master >>>>> version 2.6.12 >>>>> Dec 2 18:06:19 axxats003 puppet-master[26056]: Starting Puppet master >>>>> version 2.6.12 >>>>> Dec 2 18:06:20 axxats003 puppet-master[26081]: Starting Puppet master >>>>> version 2.6.12 >>>>> Dec 2 18:06:21 axxats003 puppet-master[26115]: Starting Puppet master >>>>> version 2.6.12 >>>>> Dec 2 18:14:32 axxats003 puppet-master[26115]: Compiled catalog for >>>>> axxatn018.sjc.company.com in environment production in 3.63 seconds >>>>> Dec 2 18:14:37 axxats003 puppet-master[26115]: Compiled catalog for >>>>> axxamb002.sjc.company.com in environment production in 1.47 seconds >>>>> Dec 2 18:14:50 axxats003 puppet-master[26115]: Compiled catalog for >>>>> axxasn001.sjc.company.com in environment production in 1.57 seconds >>>>> >>>>> There are no other messages in /var/log/messages -- the system was >>>>> otherwise not busy. Apache error log only observed max clients get hit: >>>>> [Fri Dec 02 08:42:43 2011] [notice] Apache/2.2.3 (CentOS) configured -- >>>>> resuming normal operations >>>>> [Fri Dec 02 12:23:46 2011] [error] server reached MaxClients setting, >>>>> consider raising the MaxClients setting >>>>> [Fri Dec 02 18:06:07 2011] [notice] caught SIGTERM, shutting down >>>>> [Fri Dec 02 18:06:08 2011] [notice] suEXEC mechanism enabled (wrapper: >>>>> /usr/sbin/suexec) >>>>> [Fri Dec 02 18:06:08 2011] [warn] RSA server certificate CommonName (CN) >>>>> `puppetmaster.company.com' does NOT match server name!? >>>>> [Fri Dec 02 18:06:08 2011] [notice] Digest: generating secret for digest >>>>> authentication ... >>>>> [Fri Dec 02 18:06:08 2011] [notice] Digest: done >>>>> [Fri Dec 02 18:06:08 2011] [warn] RSA server certificate CommonName (CN) >>>>> `puppetmaster.company.com' does NOT match server name!? >>>>> [Fri Dec 02 18:06:08 2011] [notice] Apache/2.2.3 (CentOS) configured -- >>>>> resuming normal operations >>>>> >>>>> >>>>> -- >>>>> Jo Rhett >>>>> [email protected] >>>>> (415) 999-1798 >>>>> >>>>> -- >>>>> Jo Rhett >>>>> Net Consonance : consonant endings by net philanthropy, open source and >>>>> other randomness >>>>> >>>> >>>> -- >>>> Jo Rhett >>>> Net Consonance : consonant endings by net philanthropy, open source and >>>> other randomness >>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google Groups >>>> "Puppet Users" group. >>>> To post to this group, send email to [email protected]. >>>> To unsubscribe from this group, send email to >>>> [email protected]. >>>> For more options, visit this group at >>>> http://groups.google.com/group/puppet-users?hl=en. >>>> >>>> >>>> >>>> -- >>>> Nigel Kersten >>>> Product Manager, Puppet Labs >>>> >>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google Groups >>>> "Puppet Users" group. >>>> To post to this group, send email to [email protected]. >>>> To unsubscribe from this group, send email to >>>> [email protected]. >>>> For more options, visit this group at >>>> http://groups.google.com/group/puppet-users?hl=en. >>> >>> -- >>> Jo Rhett >>> Net Consonance : consonant endings by net philanthropy, open source and >>> other randomness >>> >> >> -- >> Jo Rhett >> Net Consonance : consonant endings by net philanthropy, open source and >> other randomness >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Puppet Users" group. >> To post to this group, send email to [email protected]. >> To unsubscribe from this group, send email to >> [email protected]. >> For more options, visit this group at >> http://groups.google.com/group/puppet-users?hl=en. > > -- > Jo Rhett > Net Consonance : consonant endings by net philanthropy, open source and other > randomness > > > -- > You received this message because you are subscribed to the Google Groups > "Puppet Users" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/puppet-users?hl=en. -- Jo Rhett Net Consonance : consonant endings by net philanthropy, open source and other randomness -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
