Issue #16315 has been updated by Andreas Unterkircher.
That happens on tracing the process...
Hanging around waiting for its splaylimit I guess...
<pre>
22613 stat64("/etc/puppet/puppet.conf", {st_mode=S_IFREG|0644, st_size=312,
...}) = 0
22613 select(1, [], [], [], {15, 0}) = 0 (Timeout)
22613 gettimeofday({1347388152, 853471}, NULL) = 0
22613 gettimeofday({1347388152, 853578}, NULL) = 0
22613 gettimeofday({1347388152, 853637}, NULL) = 0
22613 stat64("/etc/puppet/puppet.conf", {st_mode=S_IFREG|0644, st_size=312,
...}) = 0
22613 select(1, [], [], [], {15, 0}) = 0 (Timeout)
22613 gettimeofday({1347388167, 865075}, NULL) = 0
22613 gettimeofday({1347388167, 865184}, NULL) = 0
22613 gettimeofday({1347388167, 865244}, NULL) = 0
22613 stat64("/etc/puppet/puppet.conf", {st_mode=S_IFREG|0644, st_size=312,
...}) = 0
</pre>
once it is time to wake up, I see ruby libraries get loaded, facter runs, ...
but the interesting part is once it starts connecting to the master
<pre>
22613 recvfrom(4, "r\236\201\200\0\1\0\1\0\0\0\0\6puppet\tmm-karton\3co"...,
1024, 0, {sa_family=AF_INET, sin_port=htons
(53), sin_addr=inet_addr("10.5.134.6")}, [16]) = 54
22613 close(4) = 0
22613 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 4
22613 fcntl64(4, F_GETFL) = 0x2 (flags O_RDWR)
22613 fcntl64(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
22613 connect(4, {sa_family=AF_INET, sin_port=htons(8140),
sin_addr=inet_addr("10.1.128.122")}, 16) = -1 EINPROGRESS (Op
eration now in progress)
22613 clock_gettime(CLOCK_MONOTONIC, {1910836, 405060194}) = 0
22613 select(6, [], [4], [4], {119, 993370} <unfinished ...>
1747 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
1747 clock_gettime(CLOCK_REALTIME, {1347388245, 701102872}) = 0
1747 futex(0xb77b4370, FUTEX_WAKE_PRIVATE, 1) = 0
1747 clock_gettime(CLOCK_REALTIME, {1347388245, 701163725}) = 0
1747 futex(0xb77b4344, FUTEX_WAIT_PRIVATE, 1493043, {0, 9939147}) = -1
ETIMEDOUT (Connection timed out)
1747 clock_gettime(CLOCK_REALTIME, {1347388245, 711220912}) = 0
...
1747 futex(0xb77b4344, FUTEX_WAIT_PRIVATE, 1493049, {0, 9939045} <unfinished
...>
22613 <... select resumed> ) = 1 (out [4], left {119, 950578})
22613 connect(4, {sa_family=AF_INET, sin_port=htons(8140),
sin_addr=inet_addr("10.1.128.122")}, 16) = 0
22613 fcntl64(4, F_SETFL, O_RDWR) = 0
</pre>
there is now a successfully opened TCP connection to our master at
10.1.128.122:8140.
Then the agents parent process 22613 seems to create another thread (2016)
<pre>
22613 read(4,
"ibz\3\264@\2253\360\16\355\236\25,\3340+\336+\27\353Dy\10\211Z\n\340\374\376\"<"...,
48) = 48
22613 write(4,
"\27\3\1\1\0\33\34s\347\326\\\v.\231\334\321}\310\305=rU\"\7\6\t\266Bq\224g\275"...,
261) = 261
22613 write(4,
"\27\3\1\v\200\325\303\237\357\367E\357~\370\273uZ\\\317a\353w\5\351k\257;\\\25\263\36\320"...,
2949) = 2949
22613 rt_sigaction(SIGVTALRM, {0xb770f1b0, [], 0}, {0xb770f1b0, [], 0}, 8) = 0
22613 clone(child_stack=0xb6a98494,
flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID,
parent_tidptr=0xb6a98bd8, {entry_number:6, base_addr:0xb6a98b70,
limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1,
seg_not_present:0, useable:1}, child_tidptr=0xb6a98bd8) = 2016
2016 set_robust_list(0xb6a98be0, 0xc <unfinished ...>
22613 futex(0xb77b43a4, FUTEX_WAIT_PRIVATE, 2411, NULL <unfinished ...>
</pre>
and that one is immediately telling us connection-timed-out
<pre>
2016 clock_gettime(CLOCK_REALTIME, {1347388265, 949234569}) = 0
2016 futex(0xb77b4370, FUTEX_WAKE_PRIVATE, 1) = 0
2016 clock_gettime(CLOCK_REALTIME, {1347388265, 949293913}) = 0
2016 futex(0xb77b4344, FUTEX_WAIT_PRIVATE, 1493085, {0, 9940656}) = -1
ETIMEDOUT (Connection timed out)
2016 clock_gettime(CLOCK_REALTIME, {1347388265, 959352438}) = 0
2016 futex(0xb77b4370, FUTEX_WAKE_PRIVATE, 1) = 0
2016 clock_gettime(CLOCK_REALTIME, {1347388265, 959411950}) = 0
2016 futex(0xb77b4344, FUTEX_WAIT_PRIVATE, 1493087, {0, 9940488}) = -1
ETIMEDOUT (Connection timed out)
2016 clock_gettime(CLOCK_REALTIME, {1347388265, 969472021}) = 0
2016 futex(0xb77b4370, FUTEX_WAKE_PRIVATE, 1) = 0
2016 clock_gettime(CLOCK_REALTIME, {1347388265, 969535163}) = 0
2016 futex(0xb77b4344, FUTEX_WAIT_PRIVATE, 1493089, {0, 9936858}) = -1
ETIMEDOUT (Connection timed out)
2016 clock_gettime(CLOCK_REALTIME, {1347388265, 979593697}) = 0
2016 futex(0xb77b4370, FUTEX_WAKE_PRIVATE, 1) = 0
...
</pre>
till some when 22613 has enough, kills the thread and logs the
execution-expired message.
<pre>
2016 futex(0xb77b4344, FUTEX_WAIT_PRIVATE, 1516743, {0, 9943533}) = -1
ETIMEDOUT (Connection timed out)
2016 clock_gettime(CLOCK_REALTIME, {1347388385, 865745862}) = 0
2016 futex(0xb77b4370, FUTEX_WAKE_PRIVATE, 1) = 0
2016 clock_gettime(CLOCK_REALTIME, {1347388385, 865803424}) = 0
2016 futex(0xb77b4344, FUTEX_WAIT_PRIVATE, 1516745, {0, 9942438}) = -1
ETIMEDOUT (Connection timed out)
2016 clock_gettime(CLOCK_REALTIME, {1347388385, 875863707}) = 0
2016 futex(0xb77b4370, FUTEX_WAKE_PRIVATE, 1) = 0
2016 clock_gettime(CLOCK_REALTIME, {1347388385, 875920918}) = 0
2016 futex(0xb77b4344, FUTEX_WAIT_PRIVATE, 1516747, {0, 9942789} <unfinished
...>
22613 <... select resumed> ) = 0 (Timeout)
22613 clock_gettime(CLOCK_MONOTONIC, {1910976, 584048015}) = 0
22613 select(5, [4], [], [], {0, 0}) = 0 (Timeout)
22613 time(NULL) = 1347388385
22613 select(5, [4], [], [], {0, 0}) = 0 (Timeout)
22613 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
22613 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
22613 futex(0xb77b4344, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0xb77b4340, {FUTEX_OP_SET,
0, FUTEX_OP_CMP_GT, 1} <unfinished ...>
2016 <... futex resumed> ) = 0
</pre>
But at the same time I can get a manual run by
<pre>
puppetd --test --masterport 8140 --server 10.1.128.122
</pre>
I'm rather clueless what is going here...
----------------------------------------
Bug #16315: agents stop by execution-expired after logrotate having reloading
them
https://projects.puppetlabs.com/issues/16315#change-70883
Author: Andreas Unterkircher
Status: Needs More Information
Priority: Normal
Assignee: Andreas Unterkircher
Category:
Target version:
Affected Puppet version:
Keywords:
Branch:
Having about 350 agents active in our network, now for some weeks all of them
are suffering the same problem after being reloaded by logrotate on Sunday
morning.
Agents and master are running version 2.7.18 on Debian Squeeze (ruby 1.8.7).
Our agent config is
<pre>
[main]
logdir=/var/log/puppet
vardir=/var/lib/puppet
rundir=/var/run/puppet
[agent]
server=puppet.example.com
storeconfigs=false
listen=false
report=false
splay=true
runinterval = 3600
syslogfacility = local3
masterport=8140
</pre>
As so many of our agents were affected I first thought it is related to our
master.
I have reconfigured it from Webrick (which was running very well at that time)
to Passenger.
But the issue remained the same.
What happens is that on Sunday 6:25 logrotate gets activated by cron.
Logrotate then rotates /var/log/puppet/http.log according to
/etc/logrotate.d/puppet.
<pre>
/var/log/puppet/*log {
missingok
create 0644 puppet puppet
compress
rotate 4
postrotate
[ -e /etc/init.d/puppetmaster ] && /etc/init.d/puppetmaster reload
>/dev/null 2>&1 || true
[ -e /etc/init.d/puppet ] && /etc/init.d/puppet reload > /dev/null 2>&1 ||
true
endscript
}
</pre>
After being reloaded, all agents are successfully executing one (and only one)
more run. After that all of them start to log execution-expired messages and
stop working. In syslog it looks like this.
<pre>
>>>> after being reloaded at 6:25, this is the first run
Sep 9 06:38:47 server puppet-agent[3316]:
(/Stage[main]/Nagios-common/Modulefile[/var/run/nagios.tick]/File[/var/run/nagios.tick]/content)
content changed '{md5}611d0e76b249daa2ba57118ccd63bc4b' to
'{md5}0e11de68be51c68e1d3d2f947479b756'
Sep 9 06:39:00 server puppet-agent[3316]: Finished catalog run in 15.11 seconds
>>>> starting from here, only execution-expired's
Sep 9 07:27:18 server puppet-agent[3316]: Could not retrieve catalog from
remote server: execution expired
Sep 9 07:27:19 server puppet-agent[3316]: Using cached catalog
Sep 9 08:27:18 server puppet-agent[3316]: Could not retrieve catalog from
remote server: execution expired
Sep 9 08:27:19 server puppet-agent[3316]: Using cached catalog
Sep 9 08:27:19 server puppet-agent[3316]: Could not retrieve catalog; skipping
run
Sep 9 09:27:22 server puppet-agent[3316]: Could not retrieve catalog from
remote server: execution expired
Sep 9 09:27:23 server puppet-agent[3316]: Using cached catalog
Sep 9 09:27:23 server puppet-agent[3316]: Could not retrieve catalog; skipping
run
Sep 9 10:27:28 server puppet-agent[3316]: Could not retrieve catalog from
remote server: execution expired
Sep 9 10:27:28 server puppet-agent[3316]: Using cached catalog
</pre>
The same picture on all of our other nodes.
I can trigger a successful manual run on those machine at any time.
So I don't think that it is caused by the master.
I have tried already changing the "reload" in logrotate to "restart".
But the issue remains the same.
Even keeping the agent running in foreground and having debug & trace enabled
doesn't give me a hint.
It just starts logging those execution-expired messages and is even not
attempting to connect to the master (verified by tcpdump).
It's even hard to reproduce.
When I'm sending a manual SIGHUP to a running agent it doesn't trigger this
behaviour.
It seems that only if the agent was running for some time (days) and then being
reloaded/restart by logrotate triggers our problem.
Does anyone have a hint how I can further try to track down this issue?
--
You have received this notification because you have either subscribed to it,
or are involved in it.
To change your notification preferences, please click here:
http://projects.puppetlabs.com/my/account
--
You received this message because you are subscribed to the Google Groups
"Puppet Bugs" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/puppet-bugs?hl=en.