We already have the register_cleanup option in master - are you using an older version of PMIx that doesn’t support it?
> On Oct 2, 2018, at 4:05 AM, Jeff Squyres (jsquyres) via devel > <devel@lists.open-mpi.org> wrote: > > FYI: https://github.com/open-mpi/ompi/issues/5798 brought up what may be the > same issue. > > >> On Oct 2, 2018, at 3:16 AM, Gilles Gouaillardet <gil...@rist.or.jp> wrote: >> >> Folks, >> >> >> When running a simple helloworld program on OS X, we can end up with the >> following error message >> >> >> A system call failed during shared memory initialization that should >> not have. It is likely that your MPI job will now either abort or >> experience performance degradation. >> >> Local host: c7.kmc.kobe.rist.or.jp >> System call: unlink(2) >> /tmp/ompi.c7.1000/pid.23376/1/vader_segment.c7.17d80001.54 >> Error: No such file or directory (errno 2) >> >> >> the error does not occur on linux by default since the vader segment is in >> /dev/shm by default. >> >> the patch below can be used to evidence the issue on linux >> >> >> diff --git a/opal/mca/btl/vader/btl_vader_component.c >> b/opal/mca/btl/vader/btl_vader_component.c >> index 115bceb..80fec05 100644 >> --- a/opal/mca/btl/vader/btl_vader_component.c >> +++ b/opal/mca/btl/vader/btl_vader_component.c >> @@ -204,7 +204,7 @@ static int mca_btl_vader_component_register (void) >> OPAL_INFO_LVL_3, >> MCA_BASE_VAR_SCOPE_GROUP, &mca_btl_vader_component.single_copy_mechanism); >> OBJ_RELEASE(new_enum); >> >> - if (0 == access ("/dev/shm", W_OK)) { >> + if (0 && 0 == access ("/dev/shm", W_OK)) { >> mca_btl_vader_component.backing_directory = "/dev/shm"; >> } else { >> mca_btl_vader_component.backing_directory = >> opal_process_info.job_session_dir; >> >> >> From my analysis, here is what happens : >> >> - each rank is supposed to have its own vader_segment unlinked by btl/vader >> in vader_finalize(). >> >> - but this file might have already been destroyed by an other task in >> orte_ess_base_app_finalize() >> >> if (NULL == opal_pmix.register_cleanup) { >> orte_session_dir_finalize(ORTE_PROC_MY_NAME); >> } >> >> *all* the tasks end up removing >> opal_os_dirpath_destroy("/tmp/ompi.c7.1000/pid.23941/1") >> >> >> I am not really sure about the best way to fix this. >> >> - one option is to perform an intra node barrier in vader_finalize() >> >> - an other option would be to implement an opal_pmix.register_cleanup >> >> >> Any thoughts ? >> >> >> Cheers, >> >> >> Gilles >> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/devel > > > -- > Jeff Squyres > jsquy...@cisco.com > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel _______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel