Hi Conrad, first, thanks for your detailed analysis.
[ merging your two mails for an easier response ] On 05/15/2014 11:26 AM, Conrad Hoffmann wrote: > If haproxy has been running for a certain amount of time (this part > makes reproduction cumbersome), we quite often run into the following > issue when restarting haproxy: if some of the workers have active > connections that they must wait for to be terminated, a fraction of > those would start hogging CPU like crazy. I had already observed that case myself and thought it was gone because as you explain it's very hard to reproduce. > Attaching a debugger showed > that epoll_wait() would immediately return with events that they did not > process because they had already closed the file descriptor (and thus > owner == 0x0). Could you please confirm that it happens with the listening sockets ? I'll explain why below. > This led me to the epoll(7) man page [1], which in its > Q&A section states for Q6 that just closing a fd may not be sufficient > if the inode referred to by that fd is still open in another process, > which is the case when you fork() with open fds like haproxy does. Yes, the epoll API is a real mess. It's the only one I know that can wake a process reporting activity on a process that was closed when there's a record of activity in another process! And the problem is that when it happens, you cannot remove the FD from the victim process since it does not know it, the mapping is one way : file1->fd1, even if fd1 is closed or reused somewhere for file2 with a mapping like file2->fd1. > The scenario I constructed from this information (and here I am still > missing some knowledge) is that if a processes has a connection going on > when the signal hits, it can happen that it closes a file descriptor > that it doesn't have any connections on but then, while polling for > events on the remaining connection, receives events for the closed fd > because that one is still open in another process. Sorry for this being > kind of vague, but I am still trying to understand the relevant code paths. No it's different in fact. You really need to think about files in the kernel and FDs for the processes. I'm insisting on this because it's what was unclear to me at the beginning when I first met the joy of epoll. So what happens is that a process 1 creates a listening socket consisting of a file inside the kernel. Let's call it "file1". It is mapped to an FD inside the process, called "fd1". Then the process forks and creates 12 processes. Each of them has the same fd1 pointing to file1, and file1's refcount increases to 13. Then the parent process quits, and its fd1 closes, decreasing file1's refcount to 12. Each process polls its respective fd1 with its own epoll instance (we even had to implement a complex post-fork mechanism to prevent epoll FDs from being shared between processes). When the epoll_ctl(ADD) is made on fd1, what happens is that each process creates an event in the kernel associated to file1 saying "when you see a read activity here, please wake me up and mention 'fd1' and don't try to understand what this means". At this point, all processes are able to receive events for file1 (the listeneing socket). Usually, when you close an FD, you close the associated file and it's automatically removed from the epoll table. But in fact it's not exactly true. When you close an FD, you decrement the file's refcount by one, and the file is only closed once the refcount hits zero. This is normally transparent to processes, except here with epoll : it's mentionned in the man as you saw that the events are automatically removed when the *file* closes, not the *fd*. So when a process closes the listening socket during the reload, it removes the fd1->file1 mapping, decrements the file1's refcount by one but does not kill the even until this refcount is null. Now if there's new activity on that file, any process can be woken up, including those which already closed the FD. So they receive an event saying "hi, I'm fd1" as the return value for a call to epoll_wait() while they have closed fd1 and have no way to get back to file1 anymore since their only entry point to communicate with the kernel are FDs, not files. So once haproxy gets notified about activity on a closed FD, it cannot to anything to stop receiving events for that closed FD. Yes, I said this API was a real mess, and I maintain it! This cannot happen for all other FDs because they're used in a single process. But your description made me realize the exception of listening FDs that are shared between processes and which fall into this category. > However, based on the above assumption, I came up with the attached > patch as a remedy. Basically it just implements the poller.clo(fd) > handler for epoll, which is already implemented for other pollers. I > can't say with certainty, but while we had bug hit rate of at least 80% > before, I have not been able to trigger the bug in 5 attempts with the > patch applied, so I currently think it actually works. I'm absolutely sure that your patch fixes the issue because it removes the event before getting rid of the reference to the FD. However I have a problem with this approach, it will add an extra syscall to *every* connection, which is quite noticeable in a number of environments. Ideally we should extend the poller's API to have a "close-shared" function or something like this which would be used only for listening FDs that are shared between processes. Maybe we could simply have a flag in the fdtab indicating that the fd is shared (basically all FDs opened before the fork would have it), and decide in poller.clo() to only call epoll_ctl(DEL) when this flag is set, then clear it. It would seam reasonable in my opinion. Do you think that you could extend your current patch to do that ? > I would love to hear some feedback on this issue, maybe someone can give > insight on the actual code path that might be at hand here, or possibly > any side effects this patch might introduce that I have not thought of, etc. > > Thanks a lot and keep being awesome! > Conrad > > [1] http://linux.die.net/man/7/epoll On Fri, May 16, 2014 at 02:12:23PM +0200, Conrad Hoffmann wrote: > Hi, > > I just figured out some of the missing pieces. When sending USR1 to the > workers, they just close() their listening sockets instead of shutdown() > (like they do for TTOU), causing the file descriptor to stay open in the > parent process, which is why there are still epoll events coming in for it. > > Does this make sense? I didn't think about that, but it could indeed be a reason why you see it more often than other people. I also think that you may be running with longer connections than average, increasing the probability to observe it. > But I guess the patch remains valid. Or maybe one could use shutdown() > for the USR1 path, too? It will not change anything regarding this, the issue is really with the file->fd mapping in the event. Thanks ! Willy

