On Jan 9, 2014, at 11:00 AM, Joshua Ladd <josh...@mellanox.com> wrote:

> Hcoll uses the PML as an "OOB" to bootstrap itself. When a communicator is 
> destroyed, by the time we destroy the hcoll module, the communicator context 
> is no longer valid and any pending operations that rely on its existence will 
> fail. In particular, we have a non-blocking synchronization barrier that may 
> be in progress when the communicator is destroyed.

Can you explain this a little more?  Do you mean you have a pending 
MPI_Ibarrier running on that communicator?  (i.e., the ibarrier has started but 
not completed)  Or you have some started-but-not-completed 
MPI_Isends/MPI_Irecvs?

(using the PML/coll equivalents of these of course -- not the top-level MPI_* 
foo functions)

Or are you saying that you need the destruction of the hcoll module on a given 
communicator to be synchronous between all processes in that communicator?

> Registering the delete callback allows us to finish these operations because 
> the context is still valid inside of this callback. The commented out code is 
> the "prototype" protocol that attempted to handle this scenario in an 
> entirely different (and more complex) way. It is not needed now. We don't 
> want to introduce solutions that are OMPI specific, because we need to be 
> able to integrate hcoll into other runtimes. We considered approaching the 
> community about changing the comm destroy flow in OMPI to keep the context 
> alive long enough to complete our synchronization barriers, but then the 
> solution is tied to a particular MPI

I'm not quite sure I understand -- the hcoll module (where this code is 
located) is completely OMPI-specific.  I thought that libhcoll was your 
independent-of-MPI-implementations portion of this code...?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to