On Wed, 2012-03-14 at 10:22 -0500, Alan M. Carroll wrote: > Tuesday, March 13, 2012, 11:46:15 PM, you wrote: > > > Here are my comments for what they are worth. > > > First, let me detail the issue this is trying to address. > > > The way that most clients work with VCs is via and Processor::open_XXXX > > which calls back with and OPEN event at which point they set VIOs mutex > > field and from this point on, access to the VC is under that lock including > > the do_io_close() which eventually results in the VC being decallocated. > > > There is, however, one situation where this simple and safe order of events > > is not followed. That is connection sharing to origin servers. Here the > > situations starts the same, but when the client is done with the connection > > it does not issue a do_io_close(), and this is where the problems can begin. > > That's not my interpretation of the crashes. We tried various settings for > connection sharing to no observed effect on crashing type or frequency. In > fact all of the configurations I use for testing have connection sharing > disabled.
according to your example timeline, can we tell that there is a path that httpSM dereference vc after it call vc->do_io_close? > > ming_zym wrote: > > give us more detail on the crashing back traces as much as possible. > > As far as I can tell the problem arises when the VCs in a HttpServerSession > are split across two threads. In this case, even with proper locking, the VC > in the HttpServerSession that is owned by the "other" thread can be closed > (via close_UnixNetVConnection) by that other thread. When the > HttpServerSession thread then attempts to access the VC, it crashes. In fact > many of the crashes occurred during mutex acquisition because the mutex had > been cleared (all of the invalid access of memory location 0x18 are of this > nature). I instrumented the VCs to verify this and consistently had the > result that the crash was due to one thread calling close_UnixNetVConnection > while a pointer to the VC was still "live" on another thread. Every crash I > investigated was consistent with this scenario. Because only a raw pointer > was used, there was no way to prevent or detect this. The kludge that is > currently in place serves essentially to detect the situation and cleanup > afterwards. The noticeable improvement after implementation is additional > evidence for this theory. IMHO this is the single root cause of all the > do_io_close related crashes. > > If this is the case then all of the operations involved were done under lock. > I have attached an illustration of a stack trace from such a crash where you > can see the two threads and that the HttpServerSession has VCs from both. > Even if full locking is done, it seems to me that if the 0x2b702ac4b010 > thread has called already called close_UnixNetVConnection on the 0x6a6fc7 VC, > then the 0x2b702ab4a010 thread will crash on its call to do_io_close. It > can't even acquire the mutex because that's been cleared. > > Here's an example timeline, starting with the VC not locked or actively in > use. > > Thread A (..c4b010) Thread B (..b4a010) > (VC Owner) (HttpServerSession owner) > VC->mutex->acquire() > VC->do_io_close > |- close_UnixNetVConnection > |- mutex->clear() > mutex->acquire(): crash > > No lock is going to protect you if the lock itself can go stale. > > One might ask, why is HttpServerSession split across threads like that? I > have no idea. But it seems to happen much more with forward proxy (note: I > have only indirect evidence for that). > > > > John Plevyak writes: > > So, this patch. What does it do? It uses smart points to prevent either > > of the two threads from making one particular change to the shared NetVC > > that they are currently scribbling all over: that of deleting it while the > > other is still running. It doesn't prevent any of the other horrors, or > > all other manner of crashes, race conditions and unexpected behavior, just > > the one, deallocating. It is a serious one, but not the only one. > > My current view is that this is the only problem, because in all other cases > the locking is working. > > > 3) we can do something more heavy handed and change the way locking is > > handled in the NetProcessor. In particular, wrap a lock around the code > > which allocates, frees and *changes the mutex* of a NetVC such that it is > > not possible for two threads to both be running while scribbling on the > > same NetVC. Once we do this we will not have to us smart pointers for > > NetVCs because they will be no different from anything else that you call > > free() on and after which you can't use the pointer. > > Could you describe how this would work for my example timeline? In that case > thread A calls free() and thread B uses the pointer later, thread A owning > the VC the entire time. > > Or, alternatively (for anyone) if you could describe how to prevent > HttpServerSession from referencing VCs in different threads. For all I know > there's some simple change that would fix that. >