Just so I understand, did you discover problems (maybe preexisting race conditions) with my previously posted patch? If yes, please point it out, so its easier to review yours; if not, I will assume your patch implements a better locking scheme and review it as such.
I tried to explain the issue somewhat in my change commit and code comments. The issue is synchronizing cleanup of the listen_list with device removal.
When an RDMA device is added to the system, a new listen request is added for all wildcard listens. Since the original locking held the mutex throughout the cleanup of the listen list, it prevented adding another listen request during that same time.
Similar protection was there for handling device removal. When a device is removed from the system, all internal listen requests associated with that device are destroyed. If the associated wildcard listen is also being destroyed, we need to ensure that we don't try to destroy the same listen twice.
My patch, like yours, ends up releasing the mutex while cleaning up the listen_list. I choose to eliminate the cma_destroy_listen() call, and use rdma_destroy_id() as a single destruction path instead. This keeps the locking contained to a single function. (I don't like acquiring a lock in one call and releasing it in another. It puts too much assumption on the caller.)
What was missing was ensuring that a device removal didn't try to destroy the same listen request. This is handled by the adding the list_del*() calls to cma_cancel_listens(). Whichever thread removes the listening id from the device list is responsible for its destruction. And because that thread could be the device removal thread, I added a reference from the per device listen to the wildcard listen.
Hopefully this makes sense. - Sean _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
