Hi, I'm looking for someone to form an opinion about a problem I believe I have found, and a fix.
The problem manifests itself as an error when TO attempts to send an E-mail. The details: When traffic ops runs, it daemonizes itself, and as part of that, it does $SIG{CHLD} = 'IGNORE'; : https://github.com/apache/incubator-trafficcontrol/blob/master/traffic_ops/app/lib/MojoPlugins/Daemonize.pm#L28 But, when sending a mail, TO uses Mojolicous::Plugin::Mail, which in turn is a wrapper for the standard (And Obsolete) perl module MIME::Lite. MIME::Lite sends a mail by running the following piece of code (Snippet from app/local/lib/perl5/MIME/Lite.pm): my $pid = open SENDMAIL, "|-"; defined($pid) or die "open of pipe failed: $!\n"; if ( !$pid ) { ### child exec(@cmd) or die "can't exec $p{Sendmail}: $!\n"; ### NOTREACHED } else { ### parent $self->print( \*SENDMAIL ); close SENDMAIL || die "error closing $p{Sendmail}: $! (exit $?)\n"; $return = 1; } The problem with this code is that assumes that the SIGCHLD handler is left at it's DEFAULT handling, which means that when a child process terminates, the parent process can reap it during the 'close' statement. Traffic Ops runs with the SIGCHLD handler set to IGNORE, which means that when a child process dies, the parent process forgets about it immediately. So, if sendmail runs fast enough and finishes before the 'close' command is executed, the close will fail to find the child process, and the error propagates all the way to the calling code, looking something like this: Traffic Ops fatal error occurred while processing your request. Error at line 2732 ( close SENDMAIL || die "error closing $p{Sendmail}: $! (exit $?)\n";) error closing /usr/lib/sendmail: No child processes (exit -1) It should be noted that we are running TO on single-core machines for development, which means that the code is much more likely to fail in this manner (In machines with more cores, the parent process usually does not get suspended when sendmail is spawned, hence it reaches the 'close' command immediately and waits for sendmail to finish properly). I am not sure why TO needs to completely ignore SIGCHLD, and I am not sure if it is reasonable to ask from a module such as MIME::Lite to work when SIGCHLD is set to IGNORE. However, MIME::Lite is not being maintained any more (The owner recommends using better modules, see this <http://search.cpan.org/~rjbs/MIME-Lite-3.030/lib/MIME/Lite.pm#WAIT!>), and the latest Mojolicous::Plugin::Mail version still uses MIME::Lite. (I guess Mojolicous is not very actively maintained these days). Which means that until we switch to the experimental, Angular-based, Traffic Ops, we are stuck with the current MIME::Lite code. Maybe a good fix for this issue is to not set SIGCHLD to IGNORE while daemonizing. However, this means that every time we create a new process, we must waitpid() for it, otherwise we will accumulate zombies. I guess it is not very practical to ensure this across the entire code base of TO (And, as we have seen, all Perl modules that it uses). So the low-footprint fix I want to make is to wrap every call to $self->mail(...) done in our Email.pm <https://github.com/apache/incubator-trafficcontrol/blob/master/traffic_ops/app/lib/MojoPlugins/Email.pm> code, like this: $SIG{CHLD} = 'DEFAULT'; $self->mail(...); $SIG{CHLD} = 'IGNORE'; I tested this, and it fixes the problem. Thoughts ? I plan to create a JIRA and a PR, but I wanted to get someone's more expert view on this before I go on. Being a bit of a novice, maybe I am missing some key thing here. Thanks ,Oren. -- *Oren Shemesh* Qwilt | Work: +972-72-2221637| Mobile: +972-50-2281168 | or...@qwilt.com <y...@qwilt.com>