On Jan 9, 2014, at 11:00 AM, Joshua Ladd <josh...@mellanox.com> wrote:
> Hcoll uses the PML as an "OOB" to bootstrap itself. When a communicator is > destroyed, by the time we destroy the hcoll module, the communicator context > is no longer valid and any pending operations that rely on its existence will > fail. In particular, we have a non-blocking synchronization barrier that may > be in progress when the communicator is destroyed. Can you explain this a little more? Do you mean you have a pending MPI_Ibarrier running on that communicator? (i.e., the ibarrier has started but not completed) Or you have some started-but-not-completed MPI_Isends/MPI_Irecvs? (using the PML/coll equivalents of these of course -- not the top-level MPI_* foo functions) Or are you saying that you need the destruction of the hcoll module on a given communicator to be synchronous between all processes in that communicator? > Registering the delete callback allows us to finish these operations because > the context is still valid inside of this callback. The commented out code is > the "prototype" protocol that attempted to handle this scenario in an > entirely different (and more complex) way. It is not needed now. We don't > want to introduce solutions that are OMPI specific, because we need to be > able to integrate hcoll into other runtimes. We considered approaching the > community about changing the comm destroy flow in OMPI to keep the context > alive long enough to complete our synchronization barriers, but then the > solution is tied to a particular MPI I'm not quite sure I understand -- the hcoll module (where this code is located) is completely OMPI-specific. I thought that libhcoll was your independent-of-MPI-implementations portion of this code...? -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/