I tried removing the cert from puppet master for one of the three systems that 
can't get catalogs, and removing the entire /var/lib/puppet directory on the 
client but got the exact same response. After getting a new cert and signing 
it, the client accepted the cert and then hung waiting for catalog.  The 
process on the puppetmaster hung.

Anything else I could test, check, clear…?  --debug --trace on the client 
simply shows me a timeout, no extra detail.

On Dec 2, 2011, at 3:32 PM, Jo Rhett wrote:
> I am also now pretty certain that this issue (ticket #11140) is tied directly 
> to 3 systems (in ticket #11143) which can't get catalogs. I believe their 
> attempts to get a catalog produce a hung server. 3 servers every 30 minutes 
> means that in just over 3 hours I have 20 hung puppetmasters, and the queue 
> goes out of control.
> 
> I would deeply appreciate some information on how to diagnose the catalog 
> failures and related puppetmaster hangs.
> 
> On Dec 2, 2011, at 3:09 PM, Jo Rhett wrote:
>> Hm, you know I don't think that it's a sudden lock of all 20 passenger 
>> clients.  I think it's a slow lockup of various puppet sessions until all 20 
>> are locked.  Here's an example: every one of the "active" sessions below 
>> with an uptime longer than 30 minutes has had the same "processed" number 
>> for more than 30 minutes at this time.  So in theory, they've been 
>> processing the same session for more than 30 minutes.  Somehow, I don't 
>> think so.  I think those sessions are locked up.  And what is happening is 
>> that eventually all 20 processes are hung and we are dead in the water.
>> 
>> Fri Dec  2 23:05:59 UTC 2011
>> ----------- General information -----------
>> max      = 20
>> count    = 18
>> active   = 12
>> inactive = 6
>> Waiting on global queue: 0
>> 
>> ----------- Domains -----------
>> /etc/puppet/rack: 
>>   PID: 21021   Sessions: 0    Processed: 362     Uptime: 5m 37s
>>   PID: 21005   Sessions: 0    Processed: 537     Uptime: 5m 38s
>>   PID: 21555   Sessions: 0    Processed: 69      Uptime: 30s
>>   PID: 21571   Sessions: 0    Processed: 62      Uptime: 29s
>>   PID: 20989   Sessions: 0    Processed: 209     Uptime: 5m 39s
>>   PID: 20968   Sessions: 0    Processed: 157     Uptime: 5m 41s
>>   PID: 9221    Sessions: 1    Processed: 903     Uptime: 2h 5m 55s
>>   PID: 9340    Sessions: 1    Processed: 764     Uptime: 2h 4m 58s
>>   PID: 10379   Sessions: 1    Processed: 568     Uptime: 1h 57m 37s
>>   PID: 11847   Sessions: 1    Processed: 712     Uptime: 1h 41m 13s
>>   PID: 11686   Sessions: 1    Processed: 314     Uptime: 1h 41m 19s
>>   PID: 10845   Sessions: 1    Processed: 511     Uptime: 1h 48m 52s
>>   PID: 11650   Sessions: 1    Processed: 747     Uptime: 1h 41m 21s
>>   PID: 14967   Sessions: 1    Processed: 84      Uptime: 1h 8m 28s
>>   PID: 17605   Sessions: 1    Processed: 497     Uptime: 44m 41s
>>   PID: 20342   Sessions: 1    Processed: 0       Uptime: 13m 14s
>>   PID: 20358   Sessions: 1    Processed: 54      Uptime: 13m 13s
>>   PID: 18098   Sessions: 1    Processed: 854     Uptime: 35m 46s
>> 
>> On Dec 2, 2011, at 2:22 PM, Jo Rhett wrote:
>> 
>>> On Dec 2, 2011, at 1:30 PM, Nigel Kersten wrote:
>>>> On Fri, Dec 2, 2011 at 1:03 PM, Jo Rhett <[email protected]> wrote:
>>>> Okay, this has happened again.  Puppet master stopped logging catalog 
>>>> compiles, every server stopped returning results and the global queue went 
>>>> quickly through the roof in like 9 minutes.  It appears puppet master is 
>>>> stopping dead in its tracks without logging any errors.
>>>> 
>>>> A really quick test would be to start a webrick puppetmaster on an 
>>>> alternate port with the same configuration file in debug mode and then 
>>>> puppet against it to see if there's a problem at that level,
>>>> 
>>>> (on master)
>>>> puppet master --no-daemonize --verbose --debug --masterport 9140 (for 
>>>> example)
>>>> 
>>>> (on an agent)
>>>> puppet agent --test --masterport 9140
>>> 
>>> This works perfectly fine.
>>> 
>>>> If that doesn't show anything, let us know whether you're running Apache 
>>>> prefork or worker, and your relevant pool regulation settings like:
>>>> 
>>>> StartServers
>>>> MinSpareServers
>>>> MaxSpareServers
>>>> ServerLimit
>>>> MaxClients
>>>> MaxRequestsPerChild
>>> 
>>> pre fork  with the following settings:
>>> 
>>> StartServers       8
>>> MinSpareServers    5
>>> MaxSpareServers   20
>>> ServerLimit      256
>>> MaxClients       256
>>> MaxRequestsPerChild  4000
>>> 
>>>> # passenger-status
>>>> ----------- General information -----------
>>>> max      = 20
>>>> count    = 20
>>>> active   = 20
>>>> inactive = 0
>>>> Waiting on global queue: 209
>>>> 
>>>> ----------- Domains -----------
>>>> /etc/puppet/rack: 
>>>>   PID: 25783   Sessions: 1    Processed: 329     Uptime: 2h 52m 7s
>>>>   PID: 25831   Sessions: 1    Processed: 4       Uptime: 2h 52m 5s
>>>>   PID: 28517   Sessions: 1    Processed: 6       Uptime: 2h 22m 0s
>>>>   PID: 25802   Sessions: 1    Processed: 714     Uptime: 2h 52m 6s
>>>>   PID: 30905   Sessions: 1    Processed: 13      Uptime: 1h 50m 27s
>>>>   PID: 25864   Sessions: 1    Processed: 709     Uptime: 2h 52m 4s
>>>>   PID: 31028   Sessions: 1    Processed: 347     Uptime: 1h 50m 21s
>>>>   PID: 28944   Sessions: 1    Processed: 377     Uptime: 2h 21m 50s
>>>>   PID: 31090   Sessions: 1    Processed: 266     Uptime: 1h 50m 18s
>>>>   PID: 577     Sessions: 1    Processed: 400     Uptime: 1h 27m 27s
>>>>   PID: 418     Sessions: 1    Processed: 647     Uptime: 1h 28m 2s
>>>>   PID: 1247    Sessions: 1    Processed: 133     Uptime: 1h 19m 3s
>>>>   PID: 1474    Sessions: 1    Processed: 52      Uptime: 1h 18m 9s
>>>>   PID: 594     Sessions: 1    Processed: 378     Uptime: 1h 27m 26s
>>>>   PID: 4706    Sessions: 1    Processed: 414     Uptime: 48m 5s
>>>>   PID: 4775    Sessions: 1    Processed: 218     Uptime: 47m 28s
>>>>   PID: 4854    Sessions: 1    Processed: 584     Uptime: 47m 23s
>>>>   PID: 7774    Sessions: 1    Processed: 165     Uptime: 14m 27s
>>>>   PID: 7902    Sessions: 1    Processed: 44      Uptime: 13m 44s
>>>>   PID: 8149    Sessions: 1    Processed: 541     Uptime: 11m 21s
>>>> 
>>>> 
>>>> On Dec 2, 2011, at 10:58 AM, Jo Rhett wrote:
>>>>> I came in this morning to find all the servers all locked up solid:
>>>>> 
>>>>> # passenger-status
>>>>> ----------- General information -----------
>>>>> max      = 20
>>>>> count    = 20
>>>>> active   = 20
>>>>> inactive = 0
>>>>> Waiting on global queue: 236
>>>>> 
>>>>> ----------- Domains -----------
>>>>> /etc/puppet/rack: 
>>>>>  PID: 2720    Sessions: 1    Processed: 939     Uptime: 9h 22m 18s
>>>>>  PID: 1615    Sessions: 1    Processed: 947     Uptime: 9h 23m 14s
>>>>>  PID: 1596    Sessions: 1    Processed: 607     Uptime: 9h 23m 15s
>>>>>  PID: 1722    Sessions: 1    Processed: 953     Uptime: 9h 23m 9s
>>>>>  PID: 2218    Sessions: 1    Processed: 378     Uptime: 9h 22m 43s
>>>>>  PID: 4286    Sessions: 1    Processed: 178     Uptime: 8h 50m 58s
>>>>>  PID: 5749    Sessions: 1    Processed: 708     Uptime: 8h 20m 20s
>>>>>  PID: 4253    Sessions: 1    Processed: 820     Uptime: 8h 51m 1s
>>>>>  PID: 5624    Sessions: 1    Processed: 126     Uptime: 8h 20m 24s
>>>>>  PID: 7328    Sessions: 1    Processed: 811     Uptime: 7h 49m 17s
>>>>>  PID: 7274    Sessions: 1    Processed: 984     Uptime: 7h 49m 20s
>>>>>  PID: 8761    Sessions: 1    Processed: 85      Uptime: 7h 18m 50s
>>>>>  PID: 9135    Sessions: 1    Processed: 907     Uptime: 7h 16m 27s
>>>>>  PID: 8777    Sessions: 1    Processed: 342     Uptime: 7h 18m 49s
>>>>>  PID: 10508   Sessions: 1    Processed: 51      Uptime: 6h 47m 6s
>>>>>  PID: 10853   Sessions: 1    Processed: 603     Uptime: 6h 43m 9s
>>>>>  PID: 10620   Sessions: 1    Processed: 939     Uptime: 6h 45m 52s
>>>>>  PID: 11438   Sessions: 1    Processed: 870     Uptime: 6h 30m 8s
>>>>>  PID: 12582   Sessions: 1    Processed: 448     Uptime: 6h 9m 59s
>>>>>  PID: 12670   Sessions: 1    Processed: 400     Uptime: 6h 8m 46s
>>>>> 
>>>>> For comparison, most of our server processes recycle within 20 minutes 
>>>>> normally, as they hit 1000 really fast.
>>>>> 
>>>>> # you probably want to tune these settings
>>>>> PassengerHighPerformance on
>>>>> PassengerUseGlobalQueue on
>>>>> PassengerMaxPoolSize 20
>>>>> PassengerPoolIdleTime 1800
>>>>> PassengerMaxRequests 1000
>>>>> #PassengerStatThrottleRate 120
>>>>> RackAutoDetect Off
>>>>> RailsAutoDetect Off
>>>>> 
>>>>> There is nothing useful in the system logs.  They just stopped:
>>>>> 
>>>>> Dec  2 12:06:34 axxats003 puppet-master[12670]: Compiled catalog for 
>>>>> axxamx001.sjc.company.com in environment production 
>>>>> in 1.76 seconds
>>>>> Dec  2 12:06:37 axxats003 puppet-master[12670]: Compiled catalog for 
>>>>> axxatn016.sjc.company.com in environment production 
>>>>> in 1.64 seconds
>>>>> Dec  2 12:06:40 axxats003 puppet-master[12670]: Compiled catalog for 
>>>>> axaafc001.company.com in environment production i
>>>>> n 1.70 seconds
>>>>> Dec  2 14:10:02 axxats003 puppet-agent[16965]: Reopening log files
>>>>> Dec  2 14:10:02 axxats003 puppet-agent[16965]: Starting Puppet client 
>>>>> version 2.6.12
>>>>> Dec  2 14:12:04 axxats003 puppet-agent[16965]: Could not retrieve catalog 
>>>>> from remote server: execution expired
>>>>> Dec  2 14:12:04 axxats003 puppet-agent[16965]: Using cached catalog
>>>>> 
>>>>> (every 30 minutes puppet agent says the same thing until I restart the 
>>>>> puppet master)
>>>>> 
>>>>> Dec  2 18:06:09 axxats003 puppet-master[25783]: Starting Puppet master 
>>>>> version 2.6.12
>>>>> Dec  2 18:06:10 axxats003 puppet-master[25802]: Starting Puppet master 
>>>>> version 2.6.12
>>>>> Dec  2 18:06:11 axxats003 puppet-master[25831]: Starting Puppet master 
>>>>> version 2.6.12
>>>>> Dec  2 18:06:12 axxats003 puppet-master[25864]: Starting Puppet master 
>>>>> version 2.6.12
>>>>> Dec  2 18:06:13 axxats003 puppet-master[25897]: Starting Puppet master 
>>>>> version 2.6.12
>>>>> Dec  2 18:06:14 axxats003 puppet-master[25922]: Starting Puppet master 
>>>>> version 2.6.12
>>>>> Dec  2 18:06:15 axxats003 puppet-master[25947]: Starting Puppet master 
>>>>> version 2.6.12
>>>>> Dec  2 18:06:16 axxats003 puppet-master[25972]: Starting Puppet master 
>>>>> version 2.6.12
>>>>> Dec  2 18:06:17 axxats003 puppet-master[25997]: Starting Puppet master 
>>>>> version 2.6.12
>>>>> Dec  2 18:06:18 axxats003 puppet-master[26019]: Starting Puppet master 
>>>>> version 2.6.12
>>>>> Dec  2 18:06:19 axxats003 puppet-master[26056]: Starting Puppet master 
>>>>> version 2.6.12
>>>>> Dec  2 18:06:20 axxats003 puppet-master[26081]: Starting Puppet master 
>>>>> version 2.6.12
>>>>> Dec  2 18:06:21 axxats003 puppet-master[26115]: Starting Puppet master 
>>>>> version 2.6.12
>>>>> Dec  2 18:14:32 axxats003 puppet-master[26115]: Compiled catalog for 
>>>>> axxatn018.sjc.company.com in environment production in 3.63 seconds
>>>>> Dec  2 18:14:37 axxats003 puppet-master[26115]: Compiled catalog for 
>>>>> axxamb002.sjc.company.com in environment production in 1.47 seconds
>>>>> Dec  2 18:14:50 axxats003 puppet-master[26115]: Compiled catalog for 
>>>>> axxasn001.sjc.company.com in environment production in 1.57 seconds
>>>>> 
>>>>> There are no other messages in /var/log/messages -- the system was 
>>>>> otherwise not busy.  Apache error log only observed max clients get hit:
>>>>> [Fri Dec 02 08:42:43 2011] [notice] Apache/2.2.3 (CentOS) configured -- 
>>>>> resuming normal operations
>>>>> [Fri Dec 02 12:23:46 2011] [error] server reached MaxClients setting, 
>>>>> consider raising the MaxClients setting
>>>>> [Fri Dec 02 18:06:07 2011] [notice] caught SIGTERM, shutting down
>>>>> [Fri Dec 02 18:06:08 2011] [notice] suEXEC mechanism enabled (wrapper: 
>>>>> /usr/sbin/suexec)
>>>>> [Fri Dec 02 18:06:08 2011] [warn] RSA server certificate CommonName (CN) 
>>>>> `puppetmaster.company.com' does NOT match server name!?
>>>>> [Fri Dec 02 18:06:08 2011] [notice] Digest: generating secret for digest 
>>>>> authentication ...
>>>>> [Fri Dec 02 18:06:08 2011] [notice] Digest: done
>>>>> [Fri Dec 02 18:06:08 2011] [warn] RSA server certificate CommonName (CN) 
>>>>> `puppetmaster.company.com' does NOT match server name!?
>>>>> [Fri Dec 02 18:06:08 2011] [notice] Apache/2.2.3 (CentOS) configured -- 
>>>>> resuming normal operations
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Jo Rhett
>>>>> [email protected]
>>>>> (415) 999-1798
>>>>> 
>>>>> -- 
>>>>> Jo Rhett
>>>>> Net Consonance : consonant endings by net philanthropy, open source and 
>>>>> other randomness
>>>>> 
>>>> 
>>>> -- 
>>>> Jo Rhett
>>>> Net Consonance : consonant endings by net philanthropy, open source and 
>>>> other randomness
>>>> 
>>>> 
>>>> -- 
>>>> You received this message because you are subscribed to the Google Groups 
>>>> "Puppet Users" group.
>>>> To post to this group, send email to [email protected].
>>>> To unsubscribe from this group, send email to 
>>>> [email protected].
>>>> For more options, visit this group at 
>>>> http://groups.google.com/group/puppet-users?hl=en.
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Nigel Kersten
>>>> Product Manager, Puppet Labs
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> You received this message because you are subscribed to the Google Groups 
>>>> "Puppet Users" group.
>>>> To post to this group, send email to [email protected].
>>>> To unsubscribe from this group, send email to 
>>>> [email protected].
>>>> For more options, visit this group at 
>>>> http://groups.google.com/group/puppet-users?hl=en.
>>> 
>>> -- 
>>> Jo Rhett
>>> Net Consonance : consonant endings by net philanthropy, open source and 
>>> other randomness
>>> 
>> 
>> -- 
>> Jo Rhett
>> Net Consonance : consonant endings by net philanthropy, open source and 
>> other randomness
>> 
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Puppet Users" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to 
>> [email protected].
>> For more options, visit this group at 
>> http://groups.google.com/group/puppet-users?hl=en.
> 
> -- 
> Jo Rhett
> Net Consonance : consonant endings by net philanthropy, open source and other 
> randomness
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Puppet Users" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to 
> [email protected].
> For more options, visit this group at 
> http://groups.google.com/group/puppet-users?hl=en.

-- 
Jo Rhett
Net Consonance : consonant endings by net philanthropy, open source and other 
randomness

-- 
You received this message because you are subscribed to the Google Groups 
"Puppet Users" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/puppet-users?hl=en.

Reply via email to