Monday, December 12, 2011, 1:51:54 PM, you wrote: > When I left Inktomi the inactivity timer was managed directly i.e. > INACTIVITY_TIMEOUT was set. It looks like someone checked in an > InactivityCop. This is protected by the NetHandler mutex. It also takes > out the NetVC mutex in case there is a timeout between the net_accept() and > the immediate event where the NetVC is transferred to the controlling Net > thread, but that can't happen as the inactivity_timeout_at is set to 0 in > close_UnixNetVConnection() so this is strictly unnecessary (although I > don't think harmful). We could remove this.
I tracked down several crashes to InactivityCop. Because the crashes frequently left the process state very mangled, I can't be sure of how involved it was, but I suspect it of de-allocating NetVCs while they were in use. > WTF: I have never in my life seen the code in > NEtVConnection::Handler::do_locked_io_close(). I have never even seen a > clas NetVConnection::Handler. This is extremely dangerous code. I have no > idea what it is trying to do, but I would strongly suggest removing it and > replacing it with netvc->do_io_close(). On the other had, after adding the NetVConnection::Handler change, my client observed almost an order of magnitude decrease in crashes related to this on a heavily loaded server. They are currently running a live site with this change. Conversely, the diagram here http://people.apache.org/~amc/images/TS-Lock-Classes.png is a post mortem of a crash on the 3.0.X codebase where the call was to netvc->do_io_close() (on the right, "1: do_io_close"). This is the 3.0.X codebase, before I made any changes. The crash happened because the VC was re-allocated while netvc->do_io_close was executing. At the point netvc->do_io_close is called in this situation, there is no way to guarantee that the pointer is still valid (because HttpServerSession stores a raw pointer to a netVC being handled by a different NetHandler). The mutex check you mention happens inside the method, by which time it is too late. > It calls "delete" for gosh sakes! This is a complete NONO. Before this > there were NO malloc's or free's on the critical path. This code is both > insanely inefficient as well as dangerous. I was not happy with it either, although as noted in the comments, live testing indicates that it happens only every few hours even with tens of thousands of connections / second. That field testing is the reason I was proceeding on this tack. The changes I proposed that started this string was to do something better. > do_io_close sets the "closed" flag which is then correctly managed by when > the NetHandler mutex is being held by the NetHandler. > The code in NetVConnection::Handle::do_locked_io_close is going to mangle > the data structures! Why? It handles the NetVC by calling netvc->do_io_close. > And yes, I do mean that it > should call do_io_close which DOES NOT delete the NetVC unless nh->>mutex->thread_holding == this_ethread() (in other words when it is > currently holding the NetHandler mutex). Before I made any changes (when the code was operating as you outline) it would crash inside UnixNetVConnection::do_io_close, AFAICT for this reason. But it's clear that just calling the do_io_close method is not safe. The problem (it seems to me) is that the NetVC can be deleted by the thread holding the NH lock while another thread is invoking methods on the NetVC (such as do_io_close). > Perhaps there is some confusion about what "close" means. Quite likely, as it's not really mentioned anywhere that I have found. > Do not cull entries from internal queues. Just call do_io_close() and the > owner of the memory and NetVC will reclaim it in good time. Smart pointers > of this type are a crutch. Some companies (including the one I now work > at) have an style guide which explicitly states that shared smart pointers > should not be used except in extraordinary circumstances when ownership > really is shared (e.g. Mutex). In this case ownership is extremely clear. There seems to be shared access between NetHandler, InactivityCop, and HttpServerSession. > The NetVC is owned by the NetHandler, period. How are the other two classes to safely access the NetVC from different threads? For a specific example, look at the diagram mentioned early and let me know how HttpServerSession should call netvc->do_io_close in a way that doesn't have the possibility that the virtual function pointer is garbage (due to the NetVC being on the free list) at the moment netvc->do_io_close is invoked. It can't look in the NetVC pointer to find the owning NetHandler, because that presumes the validity of the pointer.