[nwam-dev] seb-092 CR requested

Alan Maguire Mon, 28 Sep 2009 15:38:38 +0100

Michael Hunter wrote:
> On Fri, 25 Sep 2009 11:44:29 +0100
> Alan Maguire <Alan.Maguire at Sun.COM> wrote:
>
>   
>> Michael Hunter wrote:
>>     
>>> On Thu, 24 Sep 2009 08:24:03 -0700
>>> Michael Hunter <Michael.Hunter at Sun.COM> wrote:
>>>
>>> [...]
>>>   
>>>       
>>>> I think the real way to do this is to reference count anything that can
>>>> be held by multiple threads.  That'll also fan out through the code but
>>>> there are fewer error paths and we can prove it works fairly easily.
>>>>     
>>>>         
>>> New rough webrev at /net/coupe.eng/builds/mph/nwam1_work/webrev as we
>>> talked about this morning.
>>>
>>>   
>>>       
>> Wow, that was quick! So under this reference
>> counting scheme, on creation we set the refcount
>> to 2, since the list has a reference to the object along
>> with the reference derived from creating it in a locked
>> state. When we are finished with objects, we drop the
>> reference count by 1 since we have one less reference,
>> but when we destroy an object we drop the reference
>> count by 2 since we're not only intending to let our
>> reference to it go but also the reference derived from
>> the object list. If the count is 0 at this point, we call
>> nwamd_object_fini() which will actually remove the
>> object.
>>     
>
> yes
>
>   
>> I think I see a potential deadlock during object
>> destruction - if thread A is trying to destroy the
>> object, it holds the object lock. Meanwhile, thread
>> B is spinning on the object lock waiting
>> to increase the refcnt in nwamd_object_find().
>> It holds the object list lock, but not the object
>> lock. Meanwhile, thread A holds the object
>> lock but is attempting to acquire the object
>> list lock to walk the list and remove the object
>> (via nwamd_object_fini()).
>>     
>
> *blech* yes
>
>   
>> If this is an issue, I _think_ we could get around
>> it by manipulating the refcount before attempting
>> to acquire the lock in nwamd_object_find() using
>> atomic_add_XXX() (and using the atomic operation
>> for other increases/decreases). This would mean
>> that we're registered an interest in acquiring the object
>> so the refcount will not be 0 and we won't attempt
>> to fini() the object, just release it, and thread B
>> in the above example would then be responsible
>> for fini()'ing it. As I say, I may be missing something
>> here - let me know what you think. Thanks!
>>     
>
> See my comment in nwamd_object_decref() about maintaining the locking
> hierarchy between acquiring the list and locking the object.  That is
> the thing to review in the new webrev.
>
> /net/coupe.eng/builds/mph/nwam1_work/webrev
>
>   
A few questions


- Just so I follow: in nwamd_object_find() we spin
until we acquire the object lock and then increase the
refcount, all while holding the object list lock.
Another thread doing a release_and_destroy()
on the object will - if the refcount has fallen to
0 - first bump the refcount to 1, unlock the object
then lock the object list and the object. While holding
the object list and object locks we decrease the
refcount by 1 and if it reaches 0, call fini() on the
object. So because the reference
count is > 0 for the destroying thread, we are
guaranteed that another thread cannot destroy the
object in the interim, and because destruction is
predicated on having the object list lock, a thread that
was waiting in nwamd_object_find() on the object
lock will proceed first.

- do we need to ensure that multiple release_and_destroy()s
don't happen (leading to a refcount of < 0)? I can't
see any potential for this occuring in the daemon, but maybe
we could state this explicitly as an assumption?

- not something we're attempting to address here I know,
but do you have any thoughts on how we can increase
concurrency in this object model? One problem we have
at present is that if thread A holds the object lock for
a given object, and thread B wishes to acquire the same
object, it will hold the object list lock while it waits on the
object lock. As a consequence, no other threads can access
the object list (and hence any other objects) while this
is occurring. You can see this in practice with the doors
thread when nwamd is busy and we attempt an
"nwamadm list". I wonder if we can use the reference
counting approach to alleviate this? What if we took
an additional step and got rid of the object mutex
altogether? Locking an object simply reduces to
locking the object list, searching it, and when we
find the object we just bump its reference count
(using atomic_add_int()). Destroying it simly reduces
to locking the object list, and if the object reference
count is 0 after decreasing the refcount, we remove
the object. What do you think?

Thanks!

Alan

[nwam-dev] seb-092 CR requested

Reply via email to