Re: [ofa-general] Re: [PATCH 2.6.24] rdma/cm: fix deadlock destroying listen requests

Kanoj Sarcar Wed, 10 Oct 2007 14:42:36 -0700

Sean Hefty wrote:

Just so I understand, did you discover problems (maybe preexistingrace conditions) with my previously posted patch? If yes, pleasepoint it out, so its easier to review yours; if not, I will assumeyour patch implements a better locking scheme and review it as such.


Sean,

I looked over your patch for a while.

Agreed, your patch fixes a race condition that my patch had exposed (Ihad analyzed the sequence wildcard destruct getting to a device listenerbefore a racing device removal could, but not the reverse order).


I do have some issues though:

* in your patch, I suggest taking out the warning printk fromcma_listen_on_dev() when the listener create attempt fails; it might bethat the device is out of resources etc. Since the code takes care ofthis situation pretty well, I don't see a need for the printk.

* I don't see a reason for the internal_id and the device listenersgetting a refcount on the wildcard listener. Because, even withoutthese, it is guaranteed that the wildcard listener will exist at leastas long as any of the children device listener's are around, by lookingat the logic in rdma_destroy_id(). Can you provide some logic forrequring this then?

* not that I am very worried (and I suggesting resolving this thruanother subsequent patch if it is really a problem), but I think deviceremoval is still racy wrt non wildcard listeners. Here's the sequence:cma_process_remove()->cma_remove_id_dev() decides it willrdma_destroy_id() the listener id, and at the same time a processcontext rdma_destroy_id() decides it is going to do the same. There areprobably various ways to take care of this, the simple one might be forrdma_destroy_id() to look at the "state" and make a decision about whogets to destroy.


Thanks.

Kanoj

I tried to explain the issue somewhat in my change commit and codecomments. The issue is synchronizing cleanup of the listen_list withdevice removal.
When an RDMA device is added to the system, a new listen request isadded for all wildcard listens. Since the original locking held themutex throughout the cleanup of the listen list, it prevented addinganother listen request during that same time.
Similar protection was there for handling device removal. When adevice is removed from the system, all internal listen requestsassociated with that device are destroyed. If the associated wildcardlisten is also being destroyed, we need to ensure that we don't try todestroy the same listen twice.
My patch, like yours, ends up releasing the mutex while cleaning upthe listen_list. I choose to eliminate the cma_destroy_listen() call,and use rdma_destroy_id() as a single destruction path instead. Thiskeeps the locking contained to a single function. (I don't likeacquiring a lock in one call and releasing it in another. It puts toomuch assumption on the caller.)
What was missing was ensuring that a device removal didn't try todestroy the same listen request. This is handled by the adding thelist_del*() calls to cma_cancel_listens(). Whichever thread removesthe listening id from the device list is responsible for itsdestruction. And because that thread could be the device removalthread, I added a reference from the per device listen to the wildcardlisten.
Hopefully this makes sense.

- Sean


_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Re: [PATCH 2.6.24] rdma/cm: fix deadlock destroying listen requests

Reply via email to