Issue #11360 has been updated by Jo Rhett.

Completely different.  If you read 10418 you'll see that puppet agent's never 
ran at all, no matter how many times you restart them.  In this situation we 
had 50-something systems leave puppetdlock files in the state directory at 
exactly the same time.  Every one of the puppetdlocks was dated 2:45pm PST.

No commonality of kernel version.  Restarting the puppet agent works perfectly 
fine, and they are all back up again and running every 30 minutes.  Thus, not 
the same problem.

I haven't had a chance to correlate times, but I imagine this is exactly when I 
restarted the puppet server due to it being unusable due to running out of swap 
space.  That's a separate issue I'll be writing in about -- how an LDAP failure 
caused ruby processes to zombie and fill up swap.  But this ticket is about how 
50-something systems were left in a state which needed a manual touching to get 
them going again.

Yes, the puppet server not responding is traumatic.  No, it shouldn't leave the 
client in a state where it can't clean up after itself and get on with business.
----------------------------------------
Bug #11360: puppet client hangs after period of being unable to contact server
https://projects.puppetlabs.com/issues/11360

Author: Jo Rhett
Status: Re-opened
Priority: Normal
Assignee: 
Category: agent
Target version: 
Affected Puppet version: 2.6.12
Keywords: 
Branch: 


We had some serious memory/swap issues with the puppet master today.  I spent a 
few hours getting that worked out, and upgrading to passenger 3.0.11.  After 
clearing up the issues we found that some 50 systems weren't up to date.  
Logging into these systems I found a puppetdlock file which was about 4.5 hours 
old and a running puppetd which was looping doing nothing.

<pre>
[03:30 root@ald002 ~]$ ls -la /var/lib/puppet/state
total 164
drwxr-xr-t  3 root   root     4096 Dec 12 22:45 .
drwxr-xr-x 10 puppet puppet   4096 Oct 24 18:28 ..
drwxr-xr-x  2 root   root     4096 Sep 23 05:03 graphs
-rw-rw----  1 root   root     1448 Dec 12 22:15 last_run_report.yaml
-rw-rw----  1 root   root       38 Dec 12 22:15 last_run_summary.yaml
-rw-r--r--  1 root   root        4 Dec 12 22:45 puppetdlock
-rw-rw----  1 root   root   109285 Dec 12 22:15 state.yaml
</pre>

Here's an example log:
<pre>
Dec 12 21:12:27 ald002 puppet-agent[6945]: Could not retrieve catalog from 
remote server: Connection refused - connect(2)
Dec 12 21:12:27 ald002 puppet-agent[6945]: Using cached catalog
Dec 12 21:12:27 ald002 puppet-agent[6945]: Could not retrieve catalog; skipping 
run
Dec 12 21:12:27 ald002 puppet-agent[6945]: Could not send report: Connection 
refused - connect(2)
Dec 12 21:42:30 ald002 puppet-agent[6945]: Could not retrieve catalog from 
remote server: Connection refused - connect(2)
Dec 12 21:42:30 ald002 puppet-agent[6945]: Using cached catalog
Dec 12 21:42:30 ald002 puppet-agent[6945]: Could not retrieve catalog; skipping 
run
Dec 12 21:42:30 ald002 puppet-agent[6945]: Could not send report: Connection 
refused - connect(2)
Dec 12 22:15:57 ald002 puppet-agent[6945]: Could not run Puppet configuration 
client: execution expired
</pre>

<pre>
[03:32 root@ald002 ~]$ strace -p 6945
Process 6945 attached - interrupt to quit
select(11, [7 9], [], [], {1, 107000})  = 0 (Timeout)
select(11, [7 9], [], [], {0, 0})       = 0 (Timeout)
select(11, [9], [], [], {0, 0})         = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
select(11, [7 9], [], [], {1, 999999})  = 0 (Timeout)
select(11, [7 9], [], [], {0, 0})       = 0 (Timeout)
select(11, [9], [], [], {0, 0})         = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
select(11, [7 9], [], [], {1, 999999})  = 0 (Timeout)
select(11, [7 9], [], [], {0, 602})     = 0 (Timeout)
select(11, [7 9], [], [], {0, 0})       = 0 (Timeout)
select(11, [9], [], [], {0, 0})         = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
select(11, [7 9], [], [], {1, 999997})  = 0 (Timeout)
select(11, [7 9], [], [], {0, 566})     = 0 (Timeout)
select(11, [7 9], [], [], {0, 0})       = 0 (Timeout)
select(11, [9], [], [], {0, 0})         = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
select(11, [7 9], [], [], {1, 999998})  = 0 (Timeout)
select(11, [7 9], [], [], {0, 0})       = 0 (Timeout)
select(11, [9], [], [], {0, 0})         = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
select(11, [7 9], [], [], {1, 999999} <unfinished ...>
Process 6945 detached
</pre>


-- 
You have received this notification because you have either subscribed to it, 
or are involved in it.
To change your notification preferences, please click here: 
http://projects.puppetlabs.com/my/account

-- 
You received this message because you are subscribed to the Google Groups 
"Puppet Bugs" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/puppet-bugs?hl=en.

Reply via email to