https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6996
Bug ID: 6996
Summary: spamd --round-robin conflicts with multiple listen
sockets
Product: Spamassassin
Version: SVN Trunk (Latest Devel Version)
Hardware: All
OS: All
Status: NEW
Severity: major
Priority: P2
Component: spamc/spamd
Assignee: [email protected]
Reporter: [email protected]
Spamd has two modes of managing its child processes:
- when option --round-robin is specified, all free child processes
call accept() and simultaneously wait for a new task, kernel decides
which child process gets the job (its accept() call returns).
(so the --round-robin is not exactly a 'round robin', but rather
a 'catch as catch can')
- when --round-robin is *not* specified, task scheduling is a job
of the master process, using a SpamdForkScaling module. The master
waits on a select() for a new request, then delegates one of the
child processes (through backchannel pipe) to go ahead and
call accept() and take the request.
When multiple listen sockets are specified (like inet and unix
and ssl sockets with SpamAssassin 3.3, or now with 3.4 even multiple
inet and/or inet6 sockets can be specified), then the accept() should
be able to wait on multiple sockets. As this is not possible, the
solution is to use a select() on all sockets, and when one is ready,
do an accept() on it.
This is just fine in case of SpamdForkScaling scheduling, as the
master process is the one to do a select(), and activates only one
child process to go for an accept().
The trouble strikes when spamd has to listen on multiple sockets
*and* --round-robin is specified, so that child processes are
autonomous in deciding (with a help of a kernel) who gets the job.
There are two bugs there, one is easy to fix, the other is
more troublesome.
The easy to fix case is that the select() in a child process should
*not* use a timeout (0.1 seconds currently) when --round-robin
is specified, because requests are not scheduled by a master process,
so the client connect request may not yet be there when a child
process comes to its select() call. This was the problem reported
by Jan Hejl on the dev mailing list on 2014-01-28.
The other bug may not be so obvious, and has been lurking in code
ever since spamd got the ability to listen on multiple sockets
(certainly in 3.3, probably in 2.x or even earlier). This bug is
getting more exposure now that spamd can listen on any number of
sockets.
As the code stands now, all autonomous idle child processes call
a select() in case of multiple sockets. When a request comes in,
*all* of them see a socket ready and move on to accept() on it.
Thanks to kernel only one of them completes the accept(), the rest
hang on their accept() call *for that socket only*, and are
unavailable to service requests on other sockets. Effectively this
means that when this happens, only one child process (that one that
took the request and is busy now) will be available to take the
next request on *any* other socket, all the rest are only available
on the socket of the previous request.
A workaround is trivial: don't use --round-robin when listening
on multiple sockets --- or the other way around, don't listen on
multiple sockets when --round-robin is specified.
The solution is to introduce an explicit lock *before* a call
to select() in a child process (and not depend on an implicit
lock by the accept() ). This way the lock will decide which of
the autonomous child processes will take the next request.
This arbitration will take place when a child process becomes
idle (not when the connect request comes it, as it is now).
The Net::Server uses the same locking principle, so there's nothing
new about it. Seems the only question is what kind of locking to use,
apparently Windows does not / did not offer flock mechanism and
a pipe locking can be used there.
--
You are receiving this mail because:
You are the assignee for the bug.