On Fri, Nov 07, 2008 at 02:24:01PM +0300, Kandalintsev Alexandre <[EMAIL 
PROTECTED]> wrote:
>> Well, you could even complain the lkml again - I can give a precise  
>> example
>> and the required conditions for the issue.
> I remember one lkml-thread where you posted
> simple example, but I can't find it now.

In the meantime, I gained more insight into why this is a larger problem
- back then I thought this could be solved by the ev_*_fork handling, and
that this would simply be very annoying, but today I know that is the most
braindamaged and unusable API of any kernel.

> Please provide example.

Take this:

1. server creates socket (fd 7), starts to connect.
2. server forks, child gets a copy of the file descriptor
   (and other resources) but no copy of the epoll set.
3. parent gets timeout (from a timer), closes fd 7.
4. parent creates new socket (same fd 7), starts to connect.
5. original socket 7, still open in child, successfully connects,
   fd becomes writable.
6. at this point, the parent will receive writable events for that socket,
   even though it is no longer accessible from that process.

This will keep the process from sleeping (it will loop). If, as
recommended and used by e.g. libevent, the parent stores a pointer in the
epoll set, it will now also receive events with invalid pointers (probably
the reason libevent often crashes after a fork). libev didn't do that,
fortunately, but in previous versions it would busy-loop till the child
exited (or exec's, when the fd's are close-on-exec), which in many cases
is short enough a time so nobody notices it.

So why is this a problem? Why can't one just...

a) ignore the extra events? because it keeps the process from blocking,
   and in the above case, it is very hard for a library to ignore the
   writable event on the socket: for connected sockets, a non-blocking
   read or write will just return EAGAIN, but how can I find out whether a
   socket is still connecting or not when all I know is that it "might" be
   writable?

b) remove the fd? Because the epoll api is incapable of that: even if I know
   that the event refers to a file descriptor I have no longer open, there is
   no way to remove it from the epoll set.

   The epoll API uses different tokens to identify the file in the API
   compared to what it stores internally (a major design bug). That means
   there is no way to remove this fd, as "the old fd 7" cannot even be
   expressed with the current epoll API (which has other, less problematic
   issues).

   Removing the fd could be done if there were a "remove by id/token"
   function - after all, adding an fd associates a token with that fd, but
   the api does not allow one to use this token to refer to the file, one
   needs an fd, even for files which have no fd anymore, and only the token.

c) let the child help by closing the fd's (e.g. after ev_default_fork)?
   because there is no guarentee that the child even had time to execute yet -
   so any such method would be inherently racy.

d) let the child help by closing the epoll fd? because of c) and also,
   because closing the epoll fd has no effect either - it will still
   refer to the socket.
   
   Closing all other fds AND the epoll fd would work, but the child (think
   cgi) might actually want to use some of those for itself, and even if
   not, it would be a major effort, i.e. very slow. But it is inherently
   racy anyways.

e) let the parent handle it before forking? because it is hard to detect
   forks in the first place and also because, in the presence of threads,
   this would require that all threads using epoll would do that: a major
   synchronisation problem (libev currently requires only that the child
   knows about event loops that it, itself, will want to reuse, which
   works without a problem with any other kernel event mechanism. This
   solution would require EVERY fork _caller_ to know about all event
   loops in the process).

   so... yes, one could do that, but apart from the additional burden
   on the event library user, it would also be the most inefficient
   workaround: since the epoll api is so inefficient, recreating the epoll
   set takes a lot of time (more than one syscall per fd).

The way libev handles this now is by associating a generation count
with each fd, which it compares with the userspace one when it receives
an event for an fd. If the count doesn't match, we know that this is
a spurious event and will close and recreate the epoll set (with the
associated large overhead, but in practise this can be avoided in many
cases, so is probably cheaper than e), above, and is infinitely easier for
users of libev).

The only issue with it is that even a generation count of 32 bits (the
theoretical maximum if one also wants to store the fd number) might
eventually get exhausted.

No other API I know of is that braindamaged (requiring separate
add/del/mod operations is the other braindamaged API aspect, although not
handling that in an optimised fashon is just an efficiency problem, not a
corretcness issue). While most API's do require some work after a fork,
they will not generate spurious events, and you can always fix it in one
process without the queue becoming unusable in all processes.

Summary: the problem with epoll is that a fork instantly makes ALL epoll
sets used by the process unusable, with no way to recover from that, in
both parent and child, except by throwing away the epoll sets and creating
them from scratch, *iff* one is able to detect the problem.

If anything is unclear, feel free to ask for a more verbose explanation.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      [EMAIL PROTECTED]
      -=====/_/_//_/\_,_/ /_/\_\

_______________________________________________
libev mailing list
libev@lists.schmorp.de
http://lists.schmorp.de/cgi-bin/mailman/listinfo/libev

Reply via email to