Durga, There is an ongoing effort to make the master (and future versions) not only thread safe for all networks (all combinations of BTL, and PML/MTL) but also thread efficient. The fact that we removed the checks (as the one you noticed in the openib BTL) doesn't means that all BTL are currently thread safe, but that we enabled them in multi-threaded builds and are actively working to fix all issues we (and out nightly MTT tests) identify.
Overall the BTLs should only care about protecting internal state, and little else (and there is little else in the global context that the BTLs are allowed to alter). All the components that the BTLs interact with, i.e. PML, rcache and mpool, have their own protection. Few things you should pay attention to: 1. if you have async progress don't provide a progress function. This will prevent the PML from calling the progress when it might interfere with your own. Also keep in mind that the PML has no idea if it interfere with your progress, and it is possible it will call sends indiscriminately. 2. Make sure the send path is thread-safe as the PML is allowed to call the BTL API from multiple threads. 3. Don't raise any callbacks to the PML layer while holding a lock on the endpoint, or in fact on any BTL strucutres (the PML can send a message in the callback). I am sure the list of "don't do" is much longer. Feel free to ask if you find anything puzzling. George. On Sun, Mar 6, 2016 at 3:31 PM, dpchoudh . <dpcho...@gmail.com> wrote: > Hello all > > sorry for asking too many 101 questions; hopefully someone won't mind > answering. > > It looks like, as of the current release, some BTLs (e.g. openib) are not > thread safe, and the code explicitly bails out if it finds that MIT_Init() > was called with THREAD_MULTIPLE. Then there are some BTLs, such as TCP, > that can handle THREAD_MULTIPLE. Here are the questions: > > 1. There must be global (shared) variables that the BTL layer is > accessing, which is giving rise to the thread safety. Is there a list of > such variables, the code path in which they are accessed, and/or any > documentation on them (including any past mailing list post)? > > 2. Browsing through the mailing list (I have been a subscriber to the > *user* list for quite a while), it looks like a lot of people have stumbled > on to the issue that the openib BTL is not thread safe. Given that, I'd > presume, it is the most popular BTL, since infiniband-like fabrics holds a > lion's share of the HPC interconnect market, it must be quite difficult to > make it thread safe. Any comments on the level of work it would take to > make sure a new BTL would be thread safe? Something along the line of a > 'do-this' or 'don't-do-that' would be greatly appreciated. > > 3. It looks like the openib BTL bailing out if called with THREAD_MULTIPLE > has been removed in the master branch (at least from a cursory look.) Does > that mean that the openib BTL is now thread safe, of is it that the check > has simply been moved to another location? > > Thanks in advance > Durga > > Life is complex. It has real and imaginary parts. > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/03/18691.php >