Michael Hunter wrote: > On Fri, 25 Sep 2009 11:44:29 +0100 > Alan Maguire <Alan.Maguire at Sun.COM> wrote: > > >> Michael Hunter wrote: >> >>> On Thu, 24 Sep 2009 08:24:03 -0700 >>> Michael Hunter <Michael.Hunter at Sun.COM> wrote: >>> >>> [...] >>> >>> >>>> I think the real way to do this is to reference count anything that can >>>> be held by multiple threads. That'll also fan out through the code but >>>> there are fewer error paths and we can prove it works fairly easily. >>>> >>>> >>> New rough webrev at /net/coupe.eng/builds/mph/nwam1_work/webrev as we >>> talked about this morning. >>> >>> >>> >> Wow, that was quick! So under this reference >> counting scheme, on creation we set the refcount >> to 2, since the list has a reference to the object along >> with the reference derived from creating it in a locked >> state. When we are finished with objects, we drop the >> reference count by 1 since we have one less reference, >> but when we destroy an object we drop the reference >> count by 2 since we're not only intending to let our >> reference to it go but also the reference derived from >> the object list. If the count is 0 at this point, we call >> nwamd_object_fini() which will actually remove the >> object. >> > > yes > > >> I think I see a potential deadlock during object >> destruction - if thread A is trying to destroy the >> object, it holds the object lock. Meanwhile, thread >> B is spinning on the object lock waiting >> to increase the refcnt in nwamd_object_find(). >> It holds the object list lock, but not the object >> lock. Meanwhile, thread A holds the object >> lock but is attempting to acquire the object >> list lock to walk the list and remove the object >> (via nwamd_object_fini()). >> > > *blech* yes > > >> If this is an issue, I _think_ we could get around >> it by manipulating the refcount before attempting >> to acquire the lock in nwamd_object_find() using >> atomic_add_XXX() (and using the atomic operation >> for other increases/decreases). This would mean >> that we're registered an interest in acquiring the object >> so the refcount will not be 0 and we won't attempt >> to fini() the object, just release it, and thread B >> in the above example would then be responsible >> for fini()'ing it. As I say, I may be missing something >> here - let me know what you think. Thanks! >> > > See my comment in nwamd_object_decref() about maintaining the locking > hierarchy between acquiring the list and locking the object. That is > the thing to review in the new webrev. > > /net/coupe.eng/builds/mph/nwam1_work/webrev > > A few questions
- Just so I follow: in nwamd_object_find() we spin until we acquire the object lock and then increase the refcount, all while holding the object list lock. Another thread doing a release_and_destroy() on the object will - if the refcount has fallen to 0 - first bump the refcount to 1, unlock the object then lock the object list and the object. While holding the object list and object locks we decrease the refcount by 1 and if it reaches 0, call fini() on the object. So because the reference count is > 0 for the destroying thread, we are guaranteed that another thread cannot destroy the object in the interim, and because destruction is predicated on having the object list lock, a thread that was waiting in nwamd_object_find() on the object lock will proceed first. - do we need to ensure that multiple release_and_destroy()s don't happen (leading to a refcount of < 0)? I can't see any potential for this occuring in the daemon, but maybe we could state this explicitly as an assumption? - not something we're attempting to address here I know, but do you have any thoughts on how we can increase concurrency in this object model? One problem we have at present is that if thread A holds the object lock for a given object, and thread B wishes to acquire the same object, it will hold the object list lock while it waits on the object lock. As a consequence, no other threads can access the object list (and hence any other objects) while this is occurring. You can see this in practice with the doors thread when nwamd is busy and we attempt an "nwamadm list". I wonder if we can use the reference counting approach to alleviate this? What if we took an additional step and got rid of the object mutex altogether? Locking an object simply reduces to locking the object list, searching it, and when we find the object we just bump its reference count (using atomic_add_int()). Destroying it simly reduces to locking the object list, and if the object reference count is 0 after decreasing the refcount, we remove the object. What do you think? Thanks! Alan
