Using Intel OpenMP in conjunction with srun seems to cause a segmentation fault, at least in the 1.5 branch.

After a long time tracking this strange bug, I finally found out that the slurmd ess component was corrupting the __environ structure. This results in a crash in Intel OpenMP, which calls getenv() after MPI_Finalize.

In fact, during MPI_Init, the slurmd component calls putenv(), which creates a reference to a const string located in the mmap'ed text. At MPI_Finalize, we unmap() the component, which makes the __environ structure point to something that no longer exists.

Since Intel OpenMP is looking for a environment variable that does not exist, it reads all variables in __environ and crashes.

Here is a reproducer :

/* launched by "srun --resv-port" */
int main(int argc, char **argv) {
     MPI_Init(&argc, &argv);
             /* dlopens ess_slurmd.so */
             /* ess_slurmd.so will call putenv() */
     MPI_Finalize();
             /* dlcloses ess_slurmd.so */
             /* unmaps the reference, __environ is corrupted */
     getenv("unknown_var");
             /* Will read all vars from __environ and crash */
}

Attached is a patch to fix the bug. It calls unsetenv() at MPI_Finalize() to clean the environment.

Thanks you
Damien


diff -r 9d999fdda967 -r 57de231642e2 orte/mca/ess/slurmd/ess_slurmd_module.c
--- a/orte/mca/ess/slurmd/ess_slurmd_module.c	Fri Jun 04 15:29:28 2010 +0200
+++ b/orte/mca/ess/slurmd/ess_slurmd_module.c	Tue Jun 15 11:45:02 2010 +0200
@@ -387,7 +387,8 @@
             ORTE_ERROR_LOG(ret);
         }
     }
-    
+    unsetenv("OMPI_MCA_grpcomm");
+    unsetenv("OMPI_MCA_routed");
     /* deconstruct my nidmap and jobmap arrays - this
      * function protects itself from being called
      * before things were initialized

Reply via email to