Okay, just got distracted by this happening again.  The way I'm reproducing it 
is by ^C'ing a Cassandane run (which cleanly shuts down the cyrus instances it 
had started) during the JMAP tests, and I guess I'm sometimes getting lucky on 
the timing and hitting this.

On Wed, Dec 4, 2019, at 2:22 PM, ellie timoney wrote:
> I think the really useful information to collect next time this happens 
> (and while the master process is still running) is:
> 
> * What does lsof tell us about that ready file descriptor (in the 
> example, fd 11)?  I would be very interested to know if it's a client 
> socket, or an internal messaging socket (that service processes use to 
> tell master their status).

master  3121 cyrus   11u  IPv4           49559608      0t0      TCP 
localhost:9146 (LISTEN)

This is the port the http service was listening on.

> * If you can attach a debugger and step through a couple of iterations 
> of master's big "for (;;) {" loop, what path is it taking?  What 
> decisions is it making?

So the problem pops up here:

            /* connections */
            if (y >= 0 && Services[i].ready_workers == 0 &&
                Services[i].nactive < Services[i].max_workers &&
                !service_is_fork_limited(&Services[i])) {
                if (verbose > 2)
                    syslog(LOG_DEBUG, "listening for connections for %s/%s",
                           Services[i].name, Services[i].familyname);
                FD_SET(y, &rfds);
                if (y > maxfd) maxfd = y;
            }

(gdb) p Services[i]
$28 = {name = 0x55df363956c0 "http", listen = 0x55df36395170 "127.0.0.1:9146", 
  proto = 0x55df36395680 "tcp", exec = 0x55df363956e0, babysit = 0, 
  associate = 0, family = 2, familyname = 0x55df35a8c780 "ipv4", socket = 11, 
  stat = {12, 13}, desired_workers = 0, max_workers = 2147483647, maxfds = 0, 
  maxforkrate = 0, ready_workers = 0, nforks = 1, nactive = 1, 
  nconnections = 1, forkrate = 0, nreadyfails = 0, lastreadyfail = 0, 
  last_interval_start = {tv_sec = 1578455582, tv_usec = 342312}, 
  interval_forks = 1}

The http service has shut down already... sort of. It's hanging around as a 
defunct process:
cyrus     3143  3121  0 14:53 ?        00:00:00 [httpd] <defunct>

We have no ready_workers, and nactive is less than max_workers, so we're adding 
the service socket (fd 11) to the listen set.  If verbose were large enough, I 
think we would be logging "listening for connections for %s/%s" during this 
loop.

So, we have data on the socket for this service, and if we weren't already 
in_shutdown we'd be trying to spawn a new service process to handle it.  But we 
ARE in_shutdown, so we don't spawn a service, so the data on the socket remains 
unhandled.

We don't simply finish shutting down, because that nactive=1 in the service 
table entry means we think we still have children, so we call child_janitor and 
then go round the loop again.

Sooooo we're waiting for child_janitor to clean up after the http service, I 
guess.

Inside child_janitor, here's the entry:
(gdb) print **p
$36 = {pid = 3143, service_state = SERVICE_STATE_BUSY, janitor_deadline = 0, 
  si = 1, 
  desc = 0x55df36397f30 "type:SERVICE name:http 
path:/dev/shm/cyrus/main/libexec/httpd", spawntime = {tv_sec = 1578455582, 
tv_usec = 682977}, sighuptime = -1, 
  next = 0x0}

So, it thinks the http process is still busy, so we're spinning and waiting for 
it...  but the defunct state tells us it's already exited, and the system is 
just waiting for the parent process (master) to waitpid() it before getting rid 
of it completely.

It's pretty clear that there's still client data on that socket, but I'm not 
sure if "shut down Cyrus while there's still a client sending data to httpd" is 
enough to trigger it, or if it's more subtle.  The data pending on the socket 
is the reason we see pselect() called over and over again, but I don't think 
it's the cause of the problem: without that pending data, we'd still be hanging 
here instead of shutting down cleanly, but we'd be blocked in pselect() instead.

We _should_ have gotten a sigchld from httpd shutting down, and then reap_child 
should have waitpid'd it and cleaned up the child table, so we shouldn't be 
spinning like this.  So I guess the failure occurred somewhere in the sigchld 
handling, but by the time I notice it spinning it's too late to see what 
happened.  Did the sigchld go missing?  Did reap_child do the wrong thing?  Who 
knows!

I don't have any further insight at the moment.  I guess if I can find a way to 
reproduce it reasonably reliably, then I can litter reap_child with debug 
output and see what it reports just before it goes pear-shaped....

Cheers,

ellie

Reply via email to