Using Intel OpenMP in conjunction with srun seems to cause a
segmentation fault, at least in the 1.5 branch.
After a long time tracking this strange bug, I finally found out that
the slurmd ess component was corrupting the __environ structure. This
results in a crash in Intel OpenMP, which calls getenv() after
MPI_Finalize.
In fact, during MPI_Init, the slurmd component calls putenv(), which
creates a reference to a const string located in the mmap'ed text. At
MPI_Finalize, we unmap() the component, which makes the __environ
structure point to something that no longer exists.
Since Intel OpenMP is looking for a environment variable that does not
exist, it reads all variables in __environ and crashes.
Here is a reproducer :
/* launched by "srun --resv-port" */
int main(int argc, char **argv) {
MPI_Init(&argc, &argv);
/* dlopens ess_slurmd.so */
/* ess_slurmd.so will call putenv() */
MPI_Finalize();
/* dlcloses ess_slurmd.so */
/* unmaps the reference, __environ is corrupted */
getenv("unknown_var");
/* Will read all vars from __environ and crash */
}
Attached is a patch to fix the bug. It calls unsetenv() at
MPI_Finalize() to clean the environment.
Thanks you
Damien
diff -r 9d999fdda967 -r 57de231642e2 orte/mca/ess/slurmd/ess_slurmd_module.c
--- a/orte/mca/ess/slurmd/ess_slurmd_module.c Fri Jun 04 15:29:28 2010 +0200
+++ b/orte/mca/ess/slurmd/ess_slurmd_module.c Tue Jun 15 11:45:02 2010 +0200
@@ -387,7 +387,8 @@
ORTE_ERROR_LOG(ret);
}
}
-
+ unsetenv("OMPI_MCA_grpcomm");
+ unsetenv("OMPI_MCA_routed");
/* deconstruct my nidmap and jobmap arrays - this
* function protects itself from being called
* before things were initialized