Hi,

I'm looking for someone to form an opinion about a problem I believe I have
found, and a fix.

The problem manifests itself as an error when TO attempts to send an E-mail.

The details:
When traffic ops runs, it daemonizes itself, and as part of that, it does
$SIG{CHLD} = 'IGNORE'; :

https://github.com/apache/incubator-trafficcontrol/blob/master/traffic_ops/app/lib/MojoPlugins/Daemonize.pm#L28

But, when sending a mail, TO uses Mojolicous::Plugin::Mail, which in turn
is a wrapper for the standard (And Obsolete) perl module MIME::Lite.

MIME::Lite sends a mail by running the following piece of code (Snippet
from app/local/lib/perl5/MIME/Lite.pm):

       my $pid = open SENDMAIL, "|-";
        defined($pid) or die "open of pipe failed: $!\n";
        if ( !$pid ) {    ### child
            exec(@cmd) or die "can't exec $p{Sendmail}: $!\n";
            ### NOTREACHED
        } else {          ### parent
            $self->print( \*SENDMAIL );
            close SENDMAIL || die "error closing $p{Sendmail}: $! (exit
$?)\n";
            $return = 1;
        }

The problem with this code is that assumes that the SIGCHLD handler is left
at it's DEFAULT handling, which means that when a child process terminates,
the parent process can reap it during the 'close' statement.
Traffic Ops runs with the SIGCHLD handler set to IGNORE, which means that
when a child process dies, the parent process forgets about it immediately.
So, if sendmail runs fast enough and finishes before the 'close' command is
executed, the close will fail to find the child process, and the error
propagates all the way to the calling code, looking something like this:

Traffic Ops fatal error occurred while processing your request.
Error at line 2732 ( close SENDMAIL || die "error closing $p{Sendmail}: $!
(exit $?)\n";)
error closing /usr/lib/sendmail: No child processes (exit -1)

It should be noted that we are running TO on single-core machines for
development, which means that the code is much more likely to fail in this
manner (In machines with more cores, the parent process usually does not
get suspended when sendmail is spawned, hence it reaches the 'close'
command immediately and waits for sendmail to finish properly).

I am not sure why TO needs to completely ignore SIGCHLD, and I am not sure
if it is reasonable to ask from a module such as MIME::Lite to work when
SIGCHLD is set to IGNORE.
However, MIME::Lite is not being maintained any more (The owner recommends
using better modules, see this
<http://search.cpan.org/~rjbs/MIME-Lite-3.030/lib/MIME/Lite.pm#WAIT!>), and
the latest Mojolicous::Plugin::Mail version still uses MIME::Lite.
(I guess Mojolicous is not very actively maintained these days).
Which means that until we switch to the experimental, Angular-based,
Traffic Ops, we are stuck with the current MIME::Lite code.

Maybe a good fix for this issue is to not set SIGCHLD to IGNORE while
daemonizing. However, this means that every time we create a new process,
we must waitpid() for it, otherwise we will accumulate zombies.
I guess it is not very practical to ensure this across the entire code base
of TO (And, as we have seen, all Perl modules that it uses).

So the low-footprint fix I want to make is to wrap every call to
$self->mail(...) done in our Email.pm
<https://github.com/apache/incubator-trafficcontrol/blob/master/traffic_ops/app/lib/MojoPlugins/Email.pm>
code, like this:

$SIG{CHLD} = 'DEFAULT';
$self->mail(...);
$SIG{CHLD} = 'IGNORE';

I tested this, and it fixes the problem.

Thoughts ?

I plan to create a JIRA and a PR, but I wanted to get someone's more expert
view on this before I go on.
Being a bit of a novice, maybe I am missing some key thing here.

Thanks ,Oren.
--

*Oren Shemesh*
Qwilt | Work: +972-72-2221637| Mobile: +972-50-2281168 | or...@qwilt.com
<y...@qwilt.com>

Reply via email to