Makes sense to me - thanks!

On Jun 15, 2010, at 9:32 AM, Damien Guinier wrote:

> Using Intel OpenMP in conjunction with srun seems to cause a segmentation 
> fault, at least in the 1.5 branch.
> 
> After a long time tracking this strange bug, I finally found out that the 
> slurmd ess component was corrupting the __environ structure. This results in 
> a crash in Intel OpenMP, which calls getenv() after MPI_Finalize.
> 
> In fact, during MPI_Init, the slurmd component calls putenv(), which creates 
> a reference to a const string located in the mmap'ed text. At MPI_Finalize, 
> we unmap() the component, which makes the __environ structure point to 
> something that no longer exists.
> 
> Since Intel OpenMP is looking for a environment variable that does not exist, 
> it reads all variables in __environ and crashes.
> 
> Here is a reproducer :
> 
> /* launched by "srun --resv-port" */
> int main(int argc, char **argv) {
>     MPI_Init(&argc, &argv);
>             /* dlopens ess_slurmd.so */
>             /* ess_slurmd.so will call putenv() */
>     MPI_Finalize();
>             /* dlcloses ess_slurmd.so */
>             /* unmaps the reference, __environ is corrupted */
>     getenv("unknown_var");
>             /* Will read all vars from __environ and crash */
> }
> 
> Attached is a patch to fix the bug. It calls unsetenv() at MPI_Finalize() to 
> clean the environment.
> 
> Thanks you
> Damien
> 
> 
> diff -r 9d999fdda967 -r 57de231642e2 orte/mca/ess/slurmd/ess_slurmd_module.c
> --- a/orte/mca/ess/slurmd/ess_slurmd_module.c Fri Jun 04 15:29:28 2010 +0200
> +++ b/orte/mca/ess/slurmd/ess_slurmd_module.c Tue Jun 15 11:45:02 2010 +0200
> @@ -387,7 +387,8 @@
>             ORTE_ERROR_LOG(ret);
>         }
>     }
> -    
> +    unsetenv("OMPI_MCA_grpcomm");
> +    unsetenv("OMPI_MCA_routed");
>     /* deconstruct my nidmap and jobmap arrays - this
>      * function protects itself from being called
>      * before things were initialized
> 


Reply via email to