Makes sense to me - thanks!
On Jun 15, 2010, at 9:32 AM, Damien Guinier wrote:
> Using Intel OpenMP in conjunction with srun seems to cause a segmentation
> fault, at least in the 1.5 branch.
>
> After a long time tracking this strange bug, I finally found out that the
> slurmd ess component was corrupting the __environ structure. This results in
> a crash in Intel OpenMP, which calls getenv() after MPI_Finalize.
>
> In fact, during MPI_Init, the slurmd component calls putenv(), which creates
> a reference to a const string located in the mmap'ed text. At MPI_Finalize,
> we unmap() the component, which makes the __environ structure point to
> something that no longer exists.
>
> Since Intel OpenMP is looking for a environment variable that does not exist,
> it reads all variables in __environ and crashes.
>
> Here is a reproducer :
>
> /* launched by "srun --resv-port" */
> int main(int argc, char **argv) {
> MPI_Init(&argc, &argv);
> /* dlopens ess_slurmd.so */
> /* ess_slurmd.so will call putenv() */
> MPI_Finalize();
> /* dlcloses ess_slurmd.so */
> /* unmaps the reference, __environ is corrupted */
> getenv("unknown_var");
> /* Will read all vars from __environ and crash */
> }
>
> Attached is a patch to fix the bug. It calls unsetenv() at MPI_Finalize() to
> clean the environment.
>
> Thanks you
> Damien
>
>
> diff -r 9d999fdda967 -r 57de231642e2 orte/mca/ess/slurmd/ess_slurmd_module.c
> --- a/orte/mca/ess/slurmd/ess_slurmd_module.c Fri Jun 04 15:29:28 2010 +0200
> +++ b/orte/mca/ess/slurmd/ess_slurmd_module.c Tue Jun 15 11:45:02 2010 +0200
> @@ -387,7 +387,8 @@
> ORTE_ERROR_LOG(ret);
> }
> }
> -
> + unsetenv("OMPI_MCA_grpcomm");
> + unsetenv("OMPI_MCA_routed");
> /* deconstruct my nidmap and jobmap arrays - this
> * function protects itself from being called
> * before things were initialized
>