Issue #10418 has been updated by Jo Rhett.

So we have found some consistency in the systems which are affected.  Certain 
classes of hosts are more often affected than others.  Very oddly, this class 
of servers is one of the classes of hosts where the fewest classes are applied 
-- and every class applied to them is applied to hundreds of other hosts in our 
environment!

So I logged into a host which hadn't checked in for a while, and found that it 
seems to be stuck within a single loop, whereas the puppet processes on systems 
running normally have more variance in their strace output.  Here's a quick 
example:

Healthy system:

select(9, [4 7], [], [], {1, 999999})   = 0 (Timeout)
select(9, [4 7], [], [], {0, 698})      = 0 (Timeout)
select(9, [4 7], [], [], {0, 0})        = 0 (Timeout)
select(6, [4], [], [], {0, 0})          = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
select(9, [4 7], [], [], {1, 999999})   = 0 (Timeout)
select(9, [4 7], [], [], {0, 0})        = 0 (Timeout)
select(6, [4], [], [], {0, 0})          = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
select(9, [4 7], [], [], {1, 999999})   = 0 (Timeout)
select(9, [4 7], [], [], {0, 0})        = 0 (Timeout)
select(6, [4], [], [], {0, 0})          = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
select(9, [4 7], [], [], {0, 465482})   = 0 (Timeout)
select(9, [4 7], [], [], {0, 0})        = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
select(9, [7], [], [], {0, 0})          = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0

So it is definitely looping through a couple selects, but the queries seem to 
vary a bit and it goes into different functions at time -- like checking 
/etc/puppet/puppet.conf.

On the systems which are hung, the process stays forever within a single loop 
without variance:

select(9, [7 8], [], [], {1, 999999})   = 0 (Timeout)
select(9, [7 8], [], [], {0, 88})       = 0 (Timeout)
select(9, [7 8], [], [], {0, 0})        = 0 (Timeout)
select(9, [8], [], [], {0, 0})          = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
select(9, [7 8], [], [], {1, 999999})   = 0 (Timeout)
select(9, [7 8], [], [], {0, 0})        = 0 (Timeout)
select(9, [8], [], [], {0, 0})          = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
select(9, [7 8], [], [], {1, 999999})   = 0 (Timeout)
select(9, [7 8], [], [], {0, 85})       = 0 (Timeout)
select(9, [7 8], [], [], {0, 0})        = 0 (Timeout)
select(9, [8], [], [], {0, 0})          = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
select(9, [7 8], [], [], {1, 999999})   = 0 (Timeout)
select(9, [7 8], [], [], {0, 113})      = 0 (Timeout)
select(9, [7 8], [], [], {0, 0})        = 0 (Timeout)
select(9, [8], [], [], {0, 0})          = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
select(9, [7 8], [], [], {1, 999998})   = 0 (Timeout)
select(9, [7 8], [], [], {0, 56})       = 0 (Timeout)
select(9, [7 8], [], [], {0, 0})        = 0 (Timeout)
select(9, [8], [], [], {0, 0})          = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0

These two systems are both running puppet 2.6.12 on CentOS 5.5.
----------------------------------------
Bug #10418: "Caught TERM; calling stop" with state/puppetdlock left in place
https://projects.puppetlabs.com/issues/10418

Author: Jo Rhett
Status: Unreviewed
Priority: Normal
Assignee: 
Category: agent
Target version: 
Affected Puppet version: 2.6.12
Keywords: 
Branch: 


Mon Oct 31 23:03:31 +0000 2011 Puppet (notice): Caught TERM; calling stop

Ever since the 2.6.12 upgrade I've been seeing these reports reach us. As in, 
about a hundred of a half thou machines. Most of the time we find that 
$vardir/state/puppetdlock is in place and blocking further puppet runs, which 
requires a manual resolution.

I wrote a quick cron script to look for puppetdlock files older than one hour, 
remove them and mail me a report and I've received several dozen in the last 
few hours. Something is clearly broken in 2.6.12, we are backgrading our 
systems to 2.6.11.

No-- I have no other information than that it crosses all of our machine types, 
and we have had no significant changes in our modules in this time period.  
Many of the machines which have failed have had zero module or manifest changes 
which would apply to them.  I cannot get this to replicate on the command line.


-- 
You have received this notification because you have either subscribed to it, 
or are involved in it.
To change your notification preferences, please click here: 
http://projects.puppetlabs.com/my/account

-- 
You received this message because you are subscribed to the Google Groups 
"Puppet Bugs" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/puppet-bugs?hl=en.

Reply via email to