FYI: https://github.com/open-mpi/ompi/issues/5798 brought up what may be the same issue.
> On Oct 2, 2018, at 3:16 AM, Gilles Gouaillardet <gil...@rist.or.jp> wrote: > > Folks, > > > When running a simple helloworld program on OS X, we can end up with the > following error message > > > A system call failed during shared memory initialization that should > not have. It is likely that your MPI job will now either abort or > experience performance degradation. > > Local host: c7.kmc.kobe.rist.or.jp > System call: unlink(2) > /tmp/ompi.c7.1000/pid.23376/1/vader_segment.c7.17d80001.54 > Error: No such file or directory (errno 2) > > > the error does not occur on linux by default since the vader segment is in > /dev/shm by default. > > the patch below can be used to evidence the issue on linux > > > diff --git a/opal/mca/btl/vader/btl_vader_component.c > b/opal/mca/btl/vader/btl_vader_component.c > index 115bceb..80fec05 100644 > --- a/opal/mca/btl/vader/btl_vader_component.c > +++ b/opal/mca/btl/vader/btl_vader_component.c > @@ -204,7 +204,7 @@ static int mca_btl_vader_component_register (void) > OPAL_INFO_LVL_3, > MCA_BASE_VAR_SCOPE_GROUP, &mca_btl_vader_component.single_copy_mechanism); > OBJ_RELEASE(new_enum); > > - if (0 == access ("/dev/shm", W_OK)) { > + if (0 && 0 == access ("/dev/shm", W_OK)) { > mca_btl_vader_component.backing_directory = "/dev/shm"; > } else { > mca_btl_vader_component.backing_directory = > opal_process_info.job_session_dir; > > > From my analysis, here is what happens : > > - each rank is supposed to have its own vader_segment unlinked by btl/vader > in vader_finalize(). > > - but this file might have already been destroyed by an other task in > orte_ess_base_app_finalize() > > if (NULL == opal_pmix.register_cleanup) { > orte_session_dir_finalize(ORTE_PROC_MY_NAME); > } > > *all* the tasks end up removing > opal_os_dirpath_destroy("/tmp/ompi.c7.1000/pid.23941/1") > > > I am not really sure about the best way to fix this. > > - one option is to perform an intra node barrier in vader_finalize() > > - an other option would be to implement an opal_pmix.register_cleanup > > > Any thoughts ? > > > Cheers, > > > Gilles > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel -- Jeff Squyres jsquy...@cisco.com _______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel