Sean Hefty wrote:
Just so I understand, did you discover problems (maybe preexisting
race conditions) with my previously posted patch? If yes, please
point it out, so its easier to review yours; if not, I will assume
your patch implements a better locking scheme and review it as such.
Sean,
I looked over your patch for a while.
Agreed, your patch fixes a race condition that my patch had exposed (I
had analyzed the sequence wildcard destruct getting to a device listener
before a racing device removal could, but not the reverse order).
I do have some issues though:
* in your patch, I suggest taking out the warning printk from
cma_listen_on_dev() when the listener create attempt fails; it might be
that the device is out of resources etc. Since the code takes care of
this situation pretty well, I don't see a need for the printk.
* I don't see a reason for the internal_id and the device listeners
getting a refcount on the wildcard listener. Because, even without
these, it is guaranteed that the wildcard listener will exist at least
as long as any of the children device listener's are around, by looking
at the logic in rdma_destroy_id(). Can you provide some logic for
requring this then?
* not that I am very worried (and I suggesting resolving this thru
another subsequent patch if it is really a problem), but I think device
removal is still racy wrt non wildcard listeners. Here's the sequence:
cma_process_remove()->cma_remove_id_dev() decides it will
rdma_destroy_id() the listener id, and at the same time a process
context rdma_destroy_id() decides it is going to do the same. There are
probably various ways to take care of this, the simple one might be for
rdma_destroy_id() to look at the "state" and make a decision about who
gets to destroy.
Thanks.
Kanoj
I tried to explain the issue somewhat in my change commit and code
comments. The issue is synchronizing cleanup of the listen_list with
device removal.
When an RDMA device is added to the system, a new listen request is
added for all wildcard listens. Since the original locking held the
mutex throughout the cleanup of the listen list, it prevented adding
another listen request during that same time.
Similar protection was there for handling device removal. When a
device is removed from the system, all internal listen requests
associated with that device are destroyed. If the associated wildcard
listen is also being destroyed, we need to ensure that we don't try to
destroy the same listen twice.
My patch, like yours, ends up releasing the mutex while cleaning up
the listen_list. I choose to eliminate the cma_destroy_listen() call,
and use rdma_destroy_id() as a single destruction path instead. This
keeps the locking contained to a single function. (I don't like
acquiring a lock in one call and releasing it in another. It puts too
much assumption on the caller.)
What was missing was ensuring that a device removal didn't try to
destroy the same listen request. This is handled by the adding the
list_del*() calls to cma_cancel_listens(). Whichever thread removes
the listening id from the device list is responsible for its
destruction. And because that thread could be the device removal
thread, I added a reference from the per device listen to the wildcard
listen.
Hopefully this makes sense.
- Sean
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general