Makes sense to me - thanks! On Jun 15, 2010, at 9:32 AM, Damien Guinier wrote:
> Using Intel OpenMP in conjunction with srun seems to cause a segmentation > fault, at least in the 1.5 branch. > > After a long time tracking this strange bug, I finally found out that the > slurmd ess component was corrupting the __environ structure. This results in > a crash in Intel OpenMP, which calls getenv() after MPI_Finalize. > > In fact, during MPI_Init, the slurmd component calls putenv(), which creates > a reference to a const string located in the mmap'ed text. At MPI_Finalize, > we unmap() the component, which makes the __environ structure point to > something that no longer exists. > > Since Intel OpenMP is looking for a environment variable that does not exist, > it reads all variables in __environ and crashes. > > Here is a reproducer : > > /* launched by "srun --resv-port" */ > int main(int argc, char **argv) { > MPI_Init(&argc, &argv); > /* dlopens ess_slurmd.so */ > /* ess_slurmd.so will call putenv() */ > MPI_Finalize(); > /* dlcloses ess_slurmd.so */ > /* unmaps the reference, __environ is corrupted */ > getenv("unknown_var"); > /* Will read all vars from __environ and crash */ > } > > Attached is a patch to fix the bug. It calls unsetenv() at MPI_Finalize() to > clean the environment. > > Thanks you > Damien > > > diff -r 9d999fdda967 -r 57de231642e2 orte/mca/ess/slurmd/ess_slurmd_module.c > --- a/orte/mca/ess/slurmd/ess_slurmd_module.c Fri Jun 04 15:29:28 2010 +0200 > +++ b/orte/mca/ess/slurmd/ess_slurmd_module.c Tue Jun 15 11:45:02 2010 +0200 > @@ -387,7 +387,8 @@ > ORTE_ERROR_LOG(ret); > } > } > - > + unsetenv("OMPI_MCA_grpcomm"); > + unsetenv("OMPI_MCA_routed"); > /* deconstruct my nidmap and jobmap arrays - this > * function protects itself from being called > * before things were initialized >