Hello I found that SLURM installations that use cgroup plugin and have TaskAffinity=yes in cgroup.conf have problems with OpenMPI: all processes on non-launch node are assigned on one core. This leads to quite poor performance. The problem can be seen only if using mpirun to start parallel application in batch script. For example: *mpirun ./mympi* If using srun with PMI affinity is setted properly: *srun ./mympi.*
Close look shows that the reason lies in the way Open MPI use srun to launch ORTE daemons. Here is example of the command line: *srun* *--nodes=1* *--ntasks=1* --kill-on-bad-exit --nodelist=node02 *orted*-mca ess slurm -mca orte_ess_jobid 3799121920 -mca orte_ess_vpid Saying *--nodes=1* *--ntasks=1* to SLURM means that you want to start one task and (with TaskAffinity=yes) it will be binded to one core. Next orted use this affinity as base for all spawned branch processes. If I understand correctly the problem behind using srun is that if you say *srun* *--nodes=1* *--ntasks=4* - then SLURM will spawn 4 independent orted processes binded to different cores which is not what we really need. I found that disabling of cpu binding as a fast hack works good for cgroup plugin. Since job runs inside a group which has core access restrictions, spawned branch processes are executed under nodes scheduler control on all allocated cores. The command line looks like this: srun *--cpu_bind=none* --nodes=1 --ntasks=1 --kill-on-bad-exit --nodelist=node02 orted -mca ess slurm -mca orte_ess_jobid 3799121920 -mca orte_ess_vpid This solution will probably won't work with SLURM task/affinity plugin. Also it may be a bad idea when strong affinity desirable. My patch to stable Open MPI version (1.6.5) is attached to this e-mail. I will try to make more reliable solution but I need more time and beforehand would like to know opinion of Open MPI developers. -- С Уважением, Поляков Артем Юрьевич Best regards, Artem Y. Polyakov
diff -Naur openmpi-1.6.5-old/orte/mca/plm/slurm/plm_slurm_module.c openmpi-1.6.5-new/orte/mca/plm/slurm/plm_slurm_module.c --- openmpi-1.6.5-old/orte/mca/plm/slurm/plm_slurm_module.c 2012-04-03 10:30:29.000000000 -0400 +++ openmpi-1.6.5-new/orte/mca/plm/slurm/plm_slurm_module.c 2014-02-12 03:59:23.763664648 -0500 @@ -257,6 +257,8 @@ /* add the srun command */ opal_argv_append(&argc, &argv, "srun"); + opal_argv_append(&argc, &argv, "--cpu_bind=none"); /* Append user defined arguments to srun */ if ( NULL != mca_plm_slurm_component.custom_args ) {