Iain,
many thanks for your work on this. This is a big patch so I will study
it carefully before applying it.
M
Iain Patterson wrote:
> I think I have a fix for the MAXFD issue. Or at least one cause of
> it. What appears to be happening is:
>
> cfexecd calls MailResult() to decide whether to send an email after
> fork()ing cfagent. Control reaches FileChecksum() at line 714 of
> cfexecd.c.
>
> -----BEGIN SNIPPET-----
> switch (type)
> {
> case 's': md = EVP_get_digestbyname("sha");
> break;
> case 'm': md = EVP_get_digestbyname("md5");
> break;
> default: FatalError("Software failure in ChecksumFile");
> }
>
> if (!md)
> {
> return 0;
> }
> -----END SNIPPET-----
>
> Under the debugger I saw EVP_get_digestbyname() return NULL and
> ChecksumFile() return 0. Quite why this is I don't know. The upshot is
> that the filehandle opened to perform the digest calculation is not closed.
>
> This results in file descriptor leakage. With lsof I could see
> cfexecd holding open these log files until eventually (after sixteen or
> so scheduled cfagent runs) the number of open filedescriptors exceeded
> MAXFD (20).
>
> On the next run LocalExec() calls cfpopen() to spawn cfagent.
> cfpopen() fork()s a child, which exec()s cfagent. The parent then
> reaches line 191 of popen.c.
>
> -----BEGIN SNIPPET-----
> if (fileno(pp) >= MAXFD)
> {
> snprintf(OUTPUT,CF_BUFSIZE,"File descriptor %d of child %d higher
> than MAXFD, check for defunct children", fileno(pp), pid);
> CfLog(cferror,OUTPUT,"");
> return NULL;
> }
> else
> {
> CHILD[fileno(pp)] = pid;
> return pp;
> }
> -----END SNIPPET-----
>
> cfopen() has returned a new file pointer to LocalExec(), which handles
> this gracefully, aborting the LocalExec() run. But it therefore never
> gets to cfpclose() and hence doesn't wait() for the cfagent child, which
> zombies.
>
> The attached patch fixes the problem by being more careful when MAXFD
> is exceeded and also by eliminating the particular cause of leakage that
> I saw.
>
> Firstly FileChecksum() closes the new filehandle before returning 0 in
> the case where EVP_get_digestbyname() fails.
>
> In popen.c I've moved the wait()/waitpid() code to a new function,
> cfpwait(pid_t) and called it not only from the original location in
> cfpclose() but also before returning NULL from cfpopen*() if the fork()
> has already completed.
>
> Even if MAXFD did still end up being exceeded, for example because
> there is another cause of leakage I didn't see, cfpclose() will now wait
> for "pid" -1, ie any child, and so not leave a zombie. It can't look up
> the specific pid from CHILD[fd] as fd is necessarily outside the array
> bounds.
>
> However, there may be more to this. A cursory grep of the source tree
> shows that the return value of cfpclose() is thrown away everywhere
> except for within package.c. While moving stuff into cfpwait() I
> noticed something a bit odd. The code looks like this:
>
> -----BEGIN SNIPPET-----
> #ifdef HAVE_WAITPID
>
> while (waitpid(pid, &status, 0) < 0
> ...
> return status;
>
> #else
>
> while ((wait_result = wait(&status)) != pid)
> ...
> if (WIFSIGNALED(status))
> ...
> if (! WIFEXITED(status))
> ...
> return (WEXITSTATUS(status));
> #endif
> -----END SNIPPET-----
>
> I suspect the #endif should come above the WIFSIGNALED() conditional.
> I haven't moved it, however, as the return code is checked in, eg
> FreeBSDPackageList(). Nonetheless it may be, given that people were
> seeing funky behaviour with RPM packages, that the return value of
> cfpopen() was causing trouble.
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Bug-cfengine mailing list
> [email protected]
> https://cfengine.org/mailman/listinfo/bug-cfengine
--
Mark Burgess
Web: http://www.iu.hio.no/~mark
Tlf: +47 22453272
_______________________________________________
Bug-cfengine mailing list
[email protected]
https://cfengine.org/mailman/listinfo/bug-cfengine