On Fri, Nov 07, 2008 at 02:24:01PM +0300, Kandalintsev Alexandre <[EMAIL PROTECTED]> wrote: >> Well, you could even complain the lkml again - I can give a precise >> example >> and the required conditions for the issue. > I remember one lkml-thread where you posted > simple example, but I can't find it now.
In the meantime, I gained more insight into why this is a larger problem - back then I thought this could be solved by the ev_*_fork handling, and that this would simply be very annoying, but today I know that is the most braindamaged and unusable API of any kernel. > Please provide example. Take this: 1. server creates socket (fd 7), starts to connect. 2. server forks, child gets a copy of the file descriptor (and other resources) but no copy of the epoll set. 3. parent gets timeout (from a timer), closes fd 7. 4. parent creates new socket (same fd 7), starts to connect. 5. original socket 7, still open in child, successfully connects, fd becomes writable. 6. at this point, the parent will receive writable events for that socket, even though it is no longer accessible from that process. This will keep the process from sleeping (it will loop). If, as recommended and used by e.g. libevent, the parent stores a pointer in the epoll set, it will now also receive events with invalid pointers (probably the reason libevent often crashes after a fork). libev didn't do that, fortunately, but in previous versions it would busy-loop till the child exited (or exec's, when the fd's are close-on-exec), which in many cases is short enough a time so nobody notices it. So why is this a problem? Why can't one just... a) ignore the extra events? because it keeps the process from blocking, and in the above case, it is very hard for a library to ignore the writable event on the socket: for connected sockets, a non-blocking read or write will just return EAGAIN, but how can I find out whether a socket is still connecting or not when all I know is that it "might" be writable? b) remove the fd? Because the epoll api is incapable of that: even if I know that the event refers to a file descriptor I have no longer open, there is no way to remove it from the epoll set. The epoll API uses different tokens to identify the file in the API compared to what it stores internally (a major design bug). That means there is no way to remove this fd, as "the old fd 7" cannot even be expressed with the current epoll API (which has other, less problematic issues). Removing the fd could be done if there were a "remove by id/token" function - after all, adding an fd associates a token with that fd, but the api does not allow one to use this token to refer to the file, one needs an fd, even for files which have no fd anymore, and only the token. c) let the child help by closing the fd's (e.g. after ev_default_fork)? because there is no guarentee that the child even had time to execute yet - so any such method would be inherently racy. d) let the child help by closing the epoll fd? because of c) and also, because closing the epoll fd has no effect either - it will still refer to the socket. Closing all other fds AND the epoll fd would work, but the child (think cgi) might actually want to use some of those for itself, and even if not, it would be a major effort, i.e. very slow. But it is inherently racy anyways. e) let the parent handle it before forking? because it is hard to detect forks in the first place and also because, in the presence of threads, this would require that all threads using epoll would do that: a major synchronisation problem (libev currently requires only that the child knows about event loops that it, itself, will want to reuse, which works without a problem with any other kernel event mechanism. This solution would require EVERY fork _caller_ to know about all event loops in the process). so... yes, one could do that, but apart from the additional burden on the event library user, it would also be the most inefficient workaround: since the epoll api is so inefficient, recreating the epoll set takes a lot of time (more than one syscall per fd). The way libev handles this now is by associating a generation count with each fd, which it compares with the userspace one when it receives an event for an fd. If the count doesn't match, we know that this is a spurious event and will close and recreate the epoll set (with the associated large overhead, but in practise this can be avoided in many cases, so is probably cheaper than e), above, and is infinitely easier for users of libev). The only issue with it is that even a generation count of 32 bits (the theoretical maximum if one also wants to store the fd number) might eventually get exhausted. No other API I know of is that braindamaged (requiring separate add/del/mod operations is the other braindamaged API aspect, although not handling that in an optimised fashon is just an efficiency problem, not a corretcness issue). While most API's do require some work after a fork, they will not generate spurious events, and you can always fix it in one process without the queue becoming unusable in all processes. Summary: the problem with epoll is that a fork instantly makes ALL epoll sets used by the process unusable, with no way to recover from that, in both parent and child, except by throwing away the epoll sets and creating them from scratch, *iff* one is able to detect the problem. If anything is unclear, feel free to ask for a more verbose explanation. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / [EMAIL PROTECTED] -=====/_/_//_/\_,_/ /_/\_\ _______________________________________________ libev mailing list libev@lists.schmorp.de http://lists.schmorp.de/cgi-bin/mailman/listinfo/libev