Date: Wed, 29 Sep 2021 09:09:06 -0700 From: "Pawel S. Veselov" <pawel.vese...@gmail.com> Message-ID: <467fc0d3-37d1-a88a-2584-420f2a06b...@gmail.com>
| So we caught where the queue is closed, and traced it back to | getaddrinfo(). That call both closes fd#3, creates a new kqueue | and leaves it open. This is the back trace from close: | | #0 0x0000732d69c07892 in close () from /usr/lib/libpthread.so.1 >From this I'm guessing that freeradius is multi-threaded ? | The full stack traces and ktraces can be found here: | | https://github.com/FreeRADIUS/freeradius-server/issues/4244 I saw some helpful data there, but hardly a full ktrace. | Our next step is to recompile libc with debugging symbols and start | poking around there to see why is it closing an fd that doesn't | belong to it, but if somebody knows why that might happen - | that'd be great. Is it possible that something at startup is closing fds, but that might be happening after the DNS resolver has been initialised ? As you saw the libc address lookup routines leave the fd open, and if something as part of a "make sure all fd's > 2 are closed at startup" type functionality went and closed it, that would cause a problem. The kqueue fd is used to monitor /etc/resolv.conf for any changes that would require (or might require) it to be re-read (which is a useful thing to do for long running daemons) - so it needs to remain open for the life of the process. There can also be issues with the resolver state if a multi-threaded program isn't correctly linked with -lpthread and gets the single threaded resolver state instead of the malloc'd version. The fd that is being closed immediately before the kqueue() happens isn't the interesting one - that's just from where resolv.conf was read immediately previously (it is fopened, in your traces, that's fd 3, then it is read (using stdio) - the file descriptor is dup'd (that's 5 in your traces) then 3 is closed (fclose()) - that part is all very boring. Then kqueue() is used to monitor fd 5 for any changes (the kqueue is fd 3, the lowest available), and if any occur, the resolver will be re-init'd (that most likely is not what is happening). kre