Durga,

There is an ongoing effort to make the master (and future versions) not
only thread safe for all networks (all combinations of BTL, and PML/MTL)
but also thread efficient. The fact that we removed the checks (as the one
you noticed in the openib BTL) doesn't means that all BTL are currently
thread safe, but that we enabled them in multi-threaded builds and are
actively working to fix all issues we (and out nightly MTT tests) identify.

Overall the BTLs should only care about protecting internal state, and
little else (and there is little else in the global context that the BTLs
are allowed to alter). All the components that the BTLs interact with, i.e.
PML, rcache and mpool, have their own protection. Few things you should pay
attention to:
1. if you have async progress don't provide a progress function. This will
prevent the PML from calling the progress when it might interfere with your
own. Also keep in mind that the PML has no idea if it interfere with your
progress, and it is possible it will call sends indiscriminately.
2. Make sure the send path is thread-safe as the PML is allowed to call the
BTL API from multiple threads.
3. Don't raise any callbacks to the PML layer while holding a lock on the
endpoint, or in fact on any BTL strucutres (the PML can send a message in
the callback).

I am sure the list of "don't do" is much longer. Feel free to ask if you
find anything puzzling.

  George.





On Sun, Mar 6, 2016 at 3:31 PM, dpchoudh . <dpcho...@gmail.com> wrote:

> Hello all
>
> sorry for asking too many 101 questions; hopefully someone won't mind
> answering.
>
> It looks like, as of the current release, some BTLs (e.g. openib) are not
> thread safe, and the code explicitly bails out if it finds that MIT_Init()
> was called with THREAD_MULTIPLE. Then there are some BTLs, such as TCP,
> that can handle THREAD_MULTIPLE. Here are the questions:
>
> 1. There must be global (shared) variables that the BTL layer is
> accessing, which is giving rise to the thread safety. Is there a list of
> such variables, the code path in which they are accessed, and/or any
> documentation on them (including any past mailing list post)?
>
> 2. Browsing through the mailing list (I have been a subscriber to the
> *user* list for quite a while), it looks like a lot of people have stumbled
> on to the issue that the openib BTL is not thread safe. Given that, I'd
> presume, it is the most popular BTL, since infiniband-like fabrics holds a
> lion's share of the HPC interconnect market, it must be quite difficult to
> make it thread safe. Any comments on the level of work it would take to
> make sure a new BTL would be thread safe? Something along the line of a
> 'do-this' or 'don't-do-that' would be greatly appreciated.
>
> 3. It looks like the openib BTL bailing out if called with THREAD_MULTIPLE
> has been removed in the master branch (at least from a cursory look.) Does
> that mean that the openib BTL is now thread safe, of is it that the check
> has simply been moved to another location?
>
> Thanks in advance
> Durga
>
> Life is complex. It has real and imaginary parts.
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/03/18691.php
>

Reply via email to