Issue #2211 has been updated by Marcin Owsiany.

I first noticed the problem when working with puppet on a train,
without any internet connection. In such situation, two problems appear:

1) Mildly annoying, frequent 2-second pauses followed by:

<pre>
      Timed out seeking value for ipaddress
</pre>

    I think this is because when I have no network connection, there is
    no nameserver in /etc/resolv.conf, and despite my hostname being in
    /etc/hosts, some resolver library does stupid things (I think it
    tries to contact a purely imaginary nameserver running on localhost,
    but I haven't thorougly investigated that).  This is a more general
    system issue, outside of puppet, as "host `hostname`" also hangs for
    5 seconds before returning:

<pre>
      ;; connection timed out; no servers could be reached
</pre>

2) More seriously, most - if not all - resoruces fail with error such as:

<pre>
      err: //beczulka/common/common_tweaks/common_bash_rc/File[/root/.bashrc]: 
Failed
      to retrieve current state of resource: No child processes
</pre>

    Adding --trace on the command line shows for example:

<pre>
        /usr/lib/ruby/1.8/puppet/util.rb:290:in `waitpid2'
        /usr/lib/ruby/1.8/puppet/util.rb:290:in `execute'
        /usr/lib/ruby/1.8/puppet/util/diff.rb:12:in `diff'
        /usr/lib/ruby/1.8/puppet/util/diff.rb:67:in `string_file_diff'
        /usr/lib/ruby/1.8/puppet/type/file/source.rb:163:in `insync?'
        /usr/lib/ruby/1.8/puppet/type.rb:965:in `propertychanges'
        /usr/lib/ruby/1.8/puppet/util/posix.rb:156:in `find_all'
        /usr/lib/ruby/1.8/puppet/type.rb:963:in `each'
        /usr/lib/ruby/1.8/puppet/type.rb:963:in `find_all'
        /usr/lib/ruby/1.8/puppet/type.rb:963:in `propertychanges'
        /usr/lib/ruby/1.8/puppet/type/file.rb:1123:in `propertychanges'
        /usr/lib/ruby/1.8/puppet/type.rb:833:in `evaluate'
        /usr/lib/ruby/1.8/puppet/transaction.rb:60:in `apply'
        /usr/lib/ruby/1.8/puppet/transaction.rb:239:in `eval_resource'
        /usr/lib/ruby/1.8/puppet/util.rb:425:in `thinmark'
        /usr/lib/ruby/1.8/benchmark.rb:293:in `measure'
        /usr/lib/ruby/1.8/benchmark.rb:307:in `realtime'
        /usr/lib/ruby/1.8/puppet/util.rb:424:in `thinmark'
        /usr/lib/ruby/1.8/puppet/transaction.rb:238:in `eval_resource'
        /usr/lib/ruby/1.8/puppet/transaction.rb:310:in `evaluate'
        /usr/lib/ruby/1.8/puppet/util.rb:425:in `thinmark'
        /usr/lib/ruby/1.8/benchmark.rb:293:in `measure'
        /usr/lib/ruby/1.8/benchmark.rb:307:in `realtime'
        /usr/lib/ruby/1.8/puppet/util.rb:424:in `thinmark'
        /usr/lib/ruby/1.8/puppet/transaction.rb:309:in `evaluate'
        /usr/lib/ruby/1.8/puppet/transaction.rb:303:in `collect'
        /usr/lib/ruby/1.8/puppet/transaction.rb:303:in `evaluate'
        /usr/lib/ruby/1.8/puppet/node/catalog.rb:124:in `apply'
        /usr/bin/puppet:233
</pre>

Now, when I do have internet connection, things are working just fine, so there
must be some connection between (1) and (2). However ignoring it for a while, I
tried to strace puppet to see what happens around before the wait() call. This
was a bit hard. Stracing all syscalls takes a huge amount of time and produces
massive output. So in the end, after ignoring some syscalls with "-e
'!nanosleep,gettimeofday,sigprocmask'" I managed to catch puppet hanging
forever after executing the following (line numbers prepended for easy
commenting later):

<pre>
     1  10764 pipe([5, 7])                      = 0
     2  10764 clone(child_stack=0, 
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7cf0b18) 
= 10833
     3  10833 close(5 <unfinished ...>
     4  10764 close(7 <unfinished ...>
     5  10833 <... close resumed> )             = 0
     6  10764 <... close resumed> )             = 0
     7  10833 dup2(7, 1 <unfinished ...>
     8  10764 fcntl64(5, F_GETFL <unfinished ...>
     9  10833 <... dup2 resumed> )              = 1
    10  10764 <... fcntl64 resumed> )           = 0 (flags O_RDONLY)
    11  10833 close(7 <unfinished ...>
    12  10764 fstat64(5,  <unfinished ...>
    13  10833 <... close resumed> )             = 0
    14  10764 <... fstat64 resumed> {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
    15  10833 close(3 <unfinished ...>
    16  10764 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
    17  10833 <... close resumed> )             = 0
    18  10764 <... mmap2 resumed> )             = 0xb7843000
    19  10833 close(4 <unfinished ...>
    20  10764 _llseek(5, 0,  <unfinished ...>
    21  10833 <... close resumed> )             = 0
    22  10764 <... _llseek resumed> 0xbf7f5270, SEEK_CUR) = -1 ESPIPE (Illegal 
seek)
    23  10833 close(5 <unfinished ...>
    24  10764 fstat64(5,  <unfinished ...>
    25  10833 <... close resumed> )             = -1 EBADF (Bad file descriptor)
    26  10764 <... fstat64 resumed> {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
    27  10833 close(6)                          = 0
    28  10833 close(7 <unfinished ...>
    29  10764 select(6, [5], [], [], {0, 16463} <unfinished ...>
    30  10833 <... close resumed> )             = -1 EBADF (Bad file descriptor)
    31  10833 close(8)                          = -1 EBADF (Bad file descriptor)
    32  10833 close(9)                          = -1 EBADF (Bad file descriptor)
    33  10833 close(10)                         = -1 EBADF (Bad file descriptor)
    34  10833 close(11)                         = -1 EBADF (Bad file descriptor)
    35  [...]
    36  10833 close(253)                        = -1 EBADF (Bad file descriptor)
    37  10833 close(254)                        = -1 EBADF (Bad file descriptor)
    38  10833 close(255)                        = -1 EBADF (Bad file descriptor)
    39  10833 execve("/sbin/ifconfig", ["/sbin/ifconfig", "eth0"], [/* 18 vars 
*/]) = 0
    40  10833 brk(0)                            = 0x8058000
    41  10833 uname({sys="Linux", node="beczulka", ...}) = 0
    42  [...]
    43  10833 write(1, "eth0      Link encap:Ethernet  H"..., 378 <unfinished 
...>
    44  10764 <... select resumed> )            = 1 (in [5], left {0, 40000})
    45  10833 <... write resumed> )             = 378
    46  10833 exit_group(0)                     = ?
    47  10764 --- SIGCHLD (Child exited) @ 0 (0) ---
    48  10764 select(6, [5], [], [], {0, 44385}) = 1 (in [5], left {0, 44385})
    49  10764 read(5, "eth0      Link encap:Ethernet  H"..., 1024) = 378
    50  10764 select(6, [5], [], [], {0, 44025}) = 1 (in [5], left {0, 44025})
    51  10764 read(5, "", 1024)                 = 0
    52  10764 close(5)                          = 0
    53  10764 munmap(0xb7843000, 4096)          = 0
    54  10764 rt_sigaction(SIGHUP, {SIG_IGN}, {0xb7f438f0, [HUP], 
SA_RESTORER|SA_RESTART, 0xb7d1c128}, 8) = 0
    55  10764 rt_sigaction(SIGQUIT, {SIG_IGN}, {0xb7f438f0, [QUIT], 
SA_RESTORER|SA_RESTART, 0xb7d1c128}, 8) = 0
    56  10764 rt_sigaction(SIGINT, {SIG_IGN}, {0xb7f438f0, [INT], 
SA_RESTORER|SA_RESTART, 0xb7d1c128}, 8) = 0
    57  10764 waitpid(10833, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG) 
= 10833
    58  10764 rt_sigaction(SIGHUP, {0xb7f438f0, [HUP], SA_RESTORER|SA_RESTART, 
0xb7d1c128}, {SIG_IGN}, 8) = 0
    59  10764 rt_sigaction(SIGQUIT, {0xb7f438f0, [QUIT], 
SA_RESTORER|SA_RESTART, 0xb7d1c128}, {SIG_IGN}, 8) = 0
    60  10764 rt_sigaction(SIGINT, {0xb7f438f0, [INT], SA_RESTORER|SA_RESTART, 
0xb7d1c128}, {SIG_IGN}, 8) = 0
    61  10764 select(0, [], [], [], {0, 0})     = 0 (Timeout)
    62  10764 pipe([5, 7])                      = 0
    63  10764 clone(child_stack=0, 
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7cf0b18) 
= 10834
    64  10834 close(5 <unfinished ...>
    65  10764 close(7 <unfinished ...>
    66  10834 <... close resumed> )             = 0
    67  10764 <... close resumed> )             = 0
    68  10834 dup2(7, 1 <unfinished ...>
    69  10764 fcntl64(5, F_GETFL <unfinished ...>
    70  10834 <... dup2 resumed> )              = 1
    71  10764 <... fcntl64 resumed> )           = 0 (flags O_RDONLY)
    72  10834 close(7 <unfinished ...>
    73  10764 fstat64(5,  <unfinished ...>
    74  10834 <... close resumed> )             = 0
    75  10764 <... fstat64 resumed> {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
    76  10834 close(3 <unfinished ...>
    77  10764 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
    78  10834 <... close resumed> )             = 0
    79  10764 <... mmap2 resumed> )             = 0xb7843000
    80  10834 close(4 <unfinished ...>
    81  10764 _llseek(5, 0,  <unfinished ...>
    82  10834 <... close resumed> )             = 0
    83  10764 <... _llseek resumed> 0xbf7f6510, SEEK_CUR) = -1 ESPIPE (Illegal 
seek)
    84  10834 close(5 <unfinished ...>
    85  10764 fstat64(5,  <unfinished ...>
    86  10834 <... close resumed> )             = -1 EBADF (Bad file descriptor)
    87  10764 <... fstat64 resumed> {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
    88  10834 close(6)                          = 0
    89  10834 close(7 <unfinished ...>
    90  10764 select(6, [5], [], [], {0, 32148} <unfinished ...>
    91  10834 <... close resumed> )             = -1 EBADF (Bad file descriptor)
    92  10834 close(8)                          = -1 EBADF (Bad file descriptor)
    93  10834 close(9)                          = -1 EBADF (Bad file descriptor)
    94  10834 close(10)                         = -1 EBADF (Bad file descriptor)
    95  [...]
    96  10834 close(253)                        = -1 EBADF (Bad file descriptor)
    97  10834 close(254)                        = -1 EBADF (Bad file descriptor)
    98  10834 close(255)                        = -1 EBADF (Bad file descriptor)
    99  10834 execve("/sbin/ifconfig", ["/sbin/ifconfig"], [/* 18 vars */]) = 0
   100  10834 brk(0)                            = 0x8058000
   101  10834 uname({sys="Linux", node="beczulka", ...}) = 0
   102  10834 write(1, "lo        Link encap:Local Loopb"..., 781 <unfinished 
...>
   103  10764 <... select resumed> )            = 1 (in [5], left {0, 0})
   104  10834 <... write resumed> )             = 781
   105  10834 exit_group(0)                     = ?
   106  10764 --- SIGCHLD (Child exited) @ 0 (0) ---
   107  10764 select(6, [5], [], [], {0, 0})    = 1 (in [5], left {0, 0})
   108  10764 waitpid(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG) = 
10834
   109  10764 waitpid(-1, 0xbf7ee868, WNOHANG)  = 0
   110  10764 select(0, [], [], [], {0, 0})     = 0 (Timeout)
   111  10764 read(5, "lo        Link encap:Local Loopb"..., 1024) = 781
   112  10764 select(0, [], [], [], {0, 0})     = 0 (Timeout)
   113  10764 select(6, [5], [], [], {0, 59201}) = 1 (in [5], left {0, 59201})
   114  10764 read(5, "", 1024)                 = 0
   115  10764 close(5)                          = 0
   116  10764 munmap(0xb7843000, 4096)          = 0
   117  10764 rt_sigaction(SIGHUP, {SIG_IGN}, {0xb7f438f0, [HUP], 
SA_RESTORER|SA_RESTART, 0xb7d1c128}, 8) = 0
   118  10764 rt_sigaction(SIGQUIT, {SIG_IGN}, {0xb7f438f0, [QUIT], 
SA_RESTORER|SA_RESTART, 0xb7d1c128}, 8) = 0
   119  10764 rt_sigaction(SIGINT, {SIG_IGN}, {0xb7f438f0, [INT], 
SA_RESTORER|SA_RESTART, 0xb7d1c128}, 8) = 0
   120  10764 waitpid(10834, 0xbf7f66a8, WNOHANG) = 0
   121  10764 select(0, [], [], [], {0, 57695}) = 0 (Timeout)
   122  10764 select(0, [], [], [], {0, 4040})  = 0 (Timeout)
   123  10764 select(0, [], [], [], {0, 0})     = 0 (Timeout)
   124  10764 waitpid(-1, 0xbf7ee868, WNOHANG)  = 0
   125  10764 waitpid(10834, 0xbf7f66a8, WNOHANG) = 0
   126  10764 select(0, [], [], [], {0, 0})     = 0 (Timeout)
   127  10764 select(0, [], [], [], {0, 59357}) = 0 (Timeout)
   128  10764 select(0, [], [], [], {0, 920})   = 0 (Timeout)
   129  10764 select(0, [], [], [], {0, 0})     = 0 (Timeout)
   130  10764 waitpid(10834, 0xbf7f66a8, WNOHANG) = 0
   131  10764 waitpid(-1, 0xbf7ee868, WNOHANG)  = 0
   132  10764 select(0, [], [], [], {0, 0})     = 0 (Timeout)
   133  10764 select(0, [], [], [], {0, 59303}) = 0 (Timeout)
   134  10764 select(0, [], [], [], {0, 901})   = 0 (Timeout)
   135  10764 select(0, [], [], [], {0, 0})     = 0 (Timeout)
   136  10764 waitpid(-1, 0xbf7ee868, WNOHANG)  = 0
   137  10764 waitpid(10834, 0xbf7f66a8, WNOHANG) = 0
   138  10764 select(0, [], [], [], {0, 0})     = 0 (Timeout)
   139  10764 select(0, [], [], [], {0, 59654}) = 0 (Timeout)
        [...] goes on forever

</pre>


My comments:
* lines 1-38 is basic pipe setup/fork/close all FDs
* 39-46: "ifconfig wlan0" does its thing
* 47-60: puppet reads ifconfig's output first, and then reaps it

So far, so good - all is well. But now notice what happens on the following
lines: the command this time is just "ifconfig", and all looks similar until
line 106. However then:
* line 107: there the timeout in select() is zero,
* line 108: the child is reaped straight away,
* line 111: only now the output is read,
* line 120: puppet makes an attempt to reap the same child again!

The circumstances are slightly different, because in the first case ECHILD is
returned and reported, but in the second case, no error is returned, and puppet
waits forever for a child with PID 10834 to exit, which will never happen.

However I think that the reason for a race might be the same - some generic
SIGCHLD handler that comes in and reaps the child before some other code can
call wait()?

FYI grepping through the log reveals that the following execve()s were called
before it hung:

<pre>
10766 execve("/usr/bin/which", ["which", "which"], [/* 21 vars */]) = 0
10768 execve("/usr/bin/which", ["which", "uname"], [/* 21 vars */]) = 0
10769 execve("/bin/uname", ["uname", "-s"], [/* 18 vars */]) = 0
10771 execve("/usr/bin/which", ["which", "uname"], [/* 21 vars */]) = 0
10772 execve("/bin/uname", ["uname", "-r"], [/* 18 vars */]) = 0
10774 execve("/usr/bin/which", ["which", "hostname"], [/* 21 vars */]) = 0
10775 execve("/bin/hostname", ["hostname"], [/* 18 vars */]) = 0
10777 execve("/usr/bin/which", ["which", "dnsdomainname"], [/* 21 vars */]) = 0
10778 execve("/bin/dnsdomainname", ["dnsdomainname"], [/* 18 vars */]) = 0
10780 execve("/usr/bin/which", ["which", "domainname"], [/* 21 vars */]) = 0
10782 execve("/usr/bin/which", ["which", "sendmail"], [/* 21 vars */]) = 0
10784 execve("/usr/bin/which", ["which", "dnsdomainname"], [/* 21 vars */]) = 0
10785 execve("/bin/dnsdomainname", ["dnsdomainname"], [/* 18 vars */]) = 0
10787 execve("/usr/bin/which", ["which", "domainname"], [/* 21 vars */]) = 0
10789 execve("/usr/bin/which", ["which", "lsb_release"], [/* 21 vars */]) = 0
10791 execve("/usr/bin/lsb_release", ["lsb_release", "-a"], [/* 21 vars */]) = 0
10792 <... execve resumed> )            = 0
10793 execve("/usr/bin/apt-cache", ["apt-cache", "policy"], [/* 21 vars */]) = 0
10796 <... execve resumed> )            = 0
10797 execve("/usr/bin/dpkg-query", ["dpkg-query", "-f", "${Version} 
${Provides}\n", "-W", "lsb-core", "lsb-cxx", "lsb-graphics", "lsb-desktop", 
"lsb-qt4", "lsb-languages", "lsb-multimedia",
+"lsb-printing"], [/* 21 vars */]) = 0
10802 execve("/sbin/ifconfig", ["/sbin/ifconfig", "-a"], [/* 18 vars */]) = 0
10804 execve("/usr/sbin/dmidecode", ["/usr/sbin/dmidecode"], [/* 21 vars */]) = 0
10805 execve("/sbin/ifconfig", ["/sbin/ifconfig", "-a"], [/* 18 vars */]) = 0
10806 execve("/sbin/ip", ["/sbin/ip", "link", "show", "wmaster0"], [/* 18 vars 
*/]) = 0
10807 execve("/sbin/ifconfig", ["/sbin/ifconfig", "wmaster0"], [/* 18 vars */]) 
= 0
10808 execve("/sbin/ip", ["/sbin/ip", "link", "show", "eth0"], [/* 18 vars */]) 
= 0
10809 execve("/sbin/ifconfig", ["/sbin/ifconfig", "eth0"], [/* 18 vars */]) = 0
10811 execve("/usr/bin/which", ["which", "uptime"], [/* 21 vars */]) = 0
10812 execve("/usr/bin/uptime", ["uptime"], [/* 18 vars */]) = 0
10813 execve("/sbin/ifconfig", ["/sbin/ifconfig"], [/* 18 vars */]) = 0
10816 execve("/usr/bin/which", ["which", "host"], [/* 21 vars */]) = 0
10817 execve("/usr/bin/host", ["host", "beczulka"], [/* 18 vars */]) = 0
10821 execve("/sbin/ifconfig", ["/sbin/ifconfig"], [/* 18 vars */]) = 0
10823 execve("/usr/bin/which", ["which", "grep"], [/* 21 vars */]) = 0
10826 execve("/bin/grep", ["grep", "physical id", "/proc/cpuinfo"], [/* 21 vars 
*/] <unfinished ...>
10826 <... execve resumed> )            = 0
10827 execve("/usr/bin/cut", ["cut", "-d:", "-f", "2"], [/* 21 vars */] 
<unfinished ...>
10827 <... execve resumed> )            = 0
10828 execve("/usr/bin/sort", ["sort", "-u"], [/* 21 vars */] <unfinished ...>
10828 <... execve resumed> )            = 0
10829 execve("/usr/bin/wc", ["wc", "-l"], [/* 21 vars */] <unfinished ...>
10829 <... execve resumed> )            = 0
10830 execve("/sbin/ip", ["/sbin/ip", "link", "show", "wlan0"], [/* 18 vars 
*/]) = 0
10831 execve("/sbin/ifconfig", ["/sbin/ifconfig", "wlan0"], [/* 18 vars */]) = 0
10832 execve("/sbin/ip", ["/sbin/ip", "link", "show", "eth0"], [/* 18 vars */]) 
= 0
10833 execve("/sbin/ifconfig", ["/sbin/ifconfig", "eth0"], [/* 18 vars */]) = 0
10834 execve("/sbin/ifconfig", ["/sbin/ifconfig"], [/* 18 vars */]) = 0
</pre>

This is puppet 0.24.8-1gg4 (the differences between this and Debian 0.24.8-1
only seem to touch packaging).



----------------------------------------
Bug #2211: Facter timeouts reap all subprocesses thus confusing Puppet
http://projects.reductivelabs.com/issues/2211

Author: John Florian
Status: Accepted
Priority: High
Assigned to: Luke Kanies
Category: library
Target version: 1.6.0
Complexity: Unknown
Affected version: 0.24.8
Keywords: 


It is no longer possible to have puppet install packages via yum/rpm if the 
network interface is not bound to an IP address.  Our use case requires using 
puppet in the non-daemon mode and this is possible for us because the system 
will have all necessary manifests and other necessary files locally.  This 
worked just fine with 0.24.6 on Fedora 10, but began failing upon the upgrade 
to 0.24.8.

See the attachments for failure messages and a code diff that seems to have 
introduced the regression.  If I revert this one change, things work nicely 
once again.  Looks like a very simple fix if it weren't for the ominous looking 
comment in the code. :-)


-- 
You have received this notification because you have either subscribed to it, 
or are involved in it.
To change your notification preferences, please click here: 
http://reductivelabs.com/redmine/my/account

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Puppet Bugs" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/puppet-bugs?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to