Re: TS-857 and finer grained locking

taorui Wed, 14 Mar 2012 08:57:56 -0700

On Wed, 2012-03-14 at 10:22 -0500, Alan M. Carroll wrote:
> Tuesday, March 13, 2012, 11:46:15 PM, you wrote:
> 
> > Here are my comments for what they are worth.
> 
> > First, let me detail the issue this is trying to address.
> 
> > The way that most clients work with VCs is via and Processor::open_XXXX
> > which calls back with and OPEN event at which point they set VIOs mutex
> > field and from this point on, access to the VC is under that lock including
> > the do_io_close() which eventually results in the VC being decallocated.
> 
> > There is, however, one situation where this simple and safe order of events
> > is not followed.  That is connection sharing to origin servers.  Here the
> > situations starts the same, but when the client is done with the connection
> > it does not issue a do_io_close(), and this is where the problems can begin.
> 
> That's not my interpretation of the crashes. We tried various settings for 
> connection sharing to no observed effect on crashing type or frequency. In 
> fact all of the configurations I use for testing have connection sharing 
> disabled.


according to your example timeline, can we tell that there is a path
that httpSM dereference vc after it call vc->do_io_close? 

> > ming_zym wrote:
> > give us more detail on the crashing back traces as much as possible.
> 
> As far as I can tell the problem arises when the VCs in a HttpServerSession 
> are split across two threads. In this case, even with proper locking, the VC 
> in the HttpServerSession that is owned by the "other" thread can be closed 
> (via close_UnixNetVConnection) by that other thread. When the 
> HttpServerSession thread then attempts to access the VC, it crashes. In fact 
> many of the crashes occurred during mutex acquisition because the mutex had 
> been cleared (all of the invalid access of memory location 0x18 are of this 
> nature). I instrumented the VCs to verify this and consistently had the 
> result that the crash was due to one thread calling close_UnixNetVConnection 
> while a pointer to the VC was still "live" on another thread. Every crash I 
> investigated was consistent with this scenario. Because only a raw pointer 
> was used, there was no way to prevent or detect this. The kludge that is 
> currently in place serves essentially to detect the situation and cleanup 
> afterwards. The noticeable improvement after implementation is additional 
> evidence for this theory. IMHO this is the single root cause of all the 
> do_io_close related crashes.
> 
> If this is the case then all of the operations involved were done under lock. 
> I have attached an illustration of a stack trace from such a crash where you 
> can see the two threads and that the HttpServerSession has VCs from both. 
> Even if full locking is done, it seems to me that if the 0x2b702ac4b010 
> thread has called already called close_UnixNetVConnection on the 0x6a6fc7 VC, 
> then the 0x2b702ab4a010 thread will crash on its call to do_io_close. It 
> can't even acquire the mutex because that's been cleared.
> 
> Here's an example timeline, starting with the VC not locked or actively in 
> use.
> 
> Thread A (..c4b010)                          Thread B (..b4a010)
> (VC Owner)                                   (HttpServerSession owner) 
> VC->mutex->acquire()
> VC->do_io_close
>  |- close_UnixNetVConnection
>     |- mutex->clear()
>                                              mutex->acquire(): crash
> 
> No lock is going to protect you if the lock itself can go stale.
> 
> One might ask, why is HttpServerSession split across threads like that? I 
> have no idea. But it seems to happen much more with forward proxy (note: I 
> have only indirect evidence for that).
> 
> 
> > John Plevyak writes:
> > So, this patch.  What does it do?  It uses smart points to prevent either
> > of the two threads from making one particular change to the shared NetVC
> > that they are currently scribbling all over: that of deleting it while the
> > other is still running.  It doesn't prevent any of the other horrors, or
> > all other manner of crashes, race conditions and unexpected behavior, just
> > the one, deallocating.  It is a serious one, but not the only one.
> 
> My current view is that this is the only problem, because in all other cases 
> the locking is working.
> 
> > 3) we can do something more heavy handed and change the way locking is
> > handled in the NetProcessor.  In particular, wrap a lock around the code
> > which allocates, frees and *changes the mutex* of a NetVC such that it is
> > not possible for two threads to both be running while scribbling on the
> > same NetVC.   Once we do this we will not have to us smart pointers for
> > NetVCs because they will be no different from anything else that you call
> > free() on and after which you can't use the pointer.
>  
> Could you describe how this would work for my example timeline? In that case 
> thread A calls free() and thread B uses the pointer later, thread A owning 
> the VC the entire time.
> 
> Or, alternatively (for anyone) if you could describe how to prevent 
> HttpServerSession from referencing VCs in different threads. For all I know 
> there's some simple change that would fix that.
>

Re: TS-857 and finer grained locking

Reply via email to