Re: [OMPI devel] mtl/psm2 and $PSM2_DEVICES

Cabral, Matias A Mon, 03 Oct 2016 12:10:41 -0700

Hi Gilles, 

An answer regarding you comment below about PSM2 supporting adding devices at 
runtime:


PSM2 does currently not support adding/initializing more devices after the 
first initialization. Implementing this will require in depth thought, since 
the initialized devices define how end point IDs are created before establish 
connection. Changing the logic in the middle of a running communication could 
be a serious thing. https://github.com/01org/opa-psm2/blob/master/psm_ep.c 

Thanks,

_MAC


-----Original Message-----
From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Gilles 
Gouaillardet
Sent: Thursday, September 29, 2016 7:45 PM
To: Open MPI Developers <devel@lists.open-mpi.org>
Subject: Re: [OMPI devel] mtl/psm2 and $PSM2_DEVICES

Ralph,

what i had in mind is ignore dynamically spawned orted for the time being, and 
only consider the number of orted spawned by mpirun at MPI_Init() time.

the rationale is that "in the real world", mpirun is invoked via a batch 
manager (that does not resize jobs) , and/or with a machinefile or from a 
standalone laptop, so handling these cases and these cases only out of the box 
looks good enough to me.

/* when a notification method is implemented in Open MPI, it will have to 
notify PSM2 somehow (e.g. you started with "shm.self" only, not let's change 
that to "self,shm,hfi") i have no clue whether PSM2 has already implemented it, 
or have it in its plans */

Cheers,

Gilles

On Fri, Sep 30, 2016 at 10:54 AM, r...@open-mpi.org <r...@open-mpi.org> wrote:
> No simple solution, I fear. I know Matias et al are looking at the 
> dynamic situation. Getting the number of orted, especially if/when 
> they can be dynamically spawned, requires use of the notification 
> method - in the plans, but not yet implemented.
>
>
> On Sep 29, 2016, at 6:13 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
>
> This is a follow-up of
> https://mail-archive.com/users@lists.open-mpi.org/msg30055.html
>
>
> Thanks Matias for the lengthy explanation.
>
>
> currently, PSM2_DEVICES is overwritten, so i do not think setting it 
> before invoking mpirun will help
>
>
> also, in this specific case
>
> - the user is running within a SLURM allocation with 2 nodes
>
> - the user specified a host file with 2 distinct nodes
>
>
> my first impression is that mtl/psm2 could/should handle this (well 
> only one condition has to be met) properly and *not* set
>
> export PSM2_DEVICES="self,shm"
>
>
> the patch below
> - does not overwrite PSM2_DEVICES
> - does not set PSM2_DEVICES when num_max_procs > num_total_procs this 
> is suboptimal, but i could not find a way to get the number of orted.
> iirc, MPI_Comm_spawn can have an orted dynamically spawned by passing 
> a host in the MPI_Info.
> if this host is not part of the hostfile (nor RM allocation ?), then 
> PSM2_DEVICES must be set manually by the user
>
>
> Ralph,
>
> is there a way to get the number of orted ?
> - if i mpirun -np 1 --host n0,n1 ... orte_process_info.num_nodes is 1 
> (i wish i could get 2)
> - if running in singleton mode, orte_process_info.num_max_procs is 0 
> (is this a bug or a feature ?)
>
> Cheers,
>
> Gilles
>
>
> diff --git a/ompi/mca/mtl/psm2/mtl_psm2_component.c
> b/ompi/mca/mtl/psm2/mtl_psm2_component.c
> index 26bccd2..52b906b 100644
> --- a/ompi/mca/mtl/psm2/mtl_psm2_component.c
> +++ b/ompi/mca/mtl/psm2/mtl_psm2_component.c
> @@ -14,6 +14,8 @@
>   * Copyright (c) 2012-2015 Los Alamos National Security, LLC.
>   *                         All rights reserved.
>   * Copyright (c) 2013-2016 Intel, Inc. All rights reserved
> + * Copyright (c) 2016      Research Organization for Information Science
> + *                         and Technology (RIST). All rights reserved.
>   * $COPYRIGHT$
>   *
>   * Additional copyrights may follow
> @@ -170,6 +172,13 @@ get_num_total_procs(int *out_ntp)  }
>
>  static int
> +get_num_max_procs(int *out_nmp)
> +{
> +  *out_nmp = (int)ompi_process_info.max_procs;
> +  return OMPI_SUCCESS;
> +}
> +
> +static int
>  get_num_local_procs(int *out_nlp)
>  {
>      /* num_local_peers does not include us in @@ -201,7 +210,7 @@ 
> ompi_mtl_psm2_component_init(bool enable_progress_threads,
>      int        verno_major = PSM2_VERNO_MAJOR;
>      int verno_minor = PSM2_VERNO_MINOR;
>      int local_rank = -1, num_local_procs = 0;
> -    int num_total_procs = 0;
> +    int num_total_procs = 0, num_max_procs = 0;
>
>      /* Compute the total number of processes on this host and our 
> local rank
>       * on that node. We need to provide PSM2 with these values so it 
> can @@ -221,6 +230,11 @@ ompi_mtl_psm2_component_init(bool 
> enable_progress_threads,
>                      "Cannot continue.\n");
>          return NULL;
>      }
> +    if (OMPI_SUCCESS != get_num_max_procs(&num_max_procs)) {
> +        opal_output(0, "Cannot determine max number of processes. "
> +                    "Cannot continue.\n");
> +        return NULL;
> +    }
>
>      err = psm2_error_register_handler(NULL /* no ep */,
>                                      PSM2_ERRHANDLER_NOP); @@ -230,8 
> +244,10 @@ ompi_mtl_psm2_component_init(bool enable_progress_threads,
>         return NULL;
>      }
>
> -    if (num_local_procs == num_total_procs) {
> -      setenv("PSM2_DEVICES", "self,shm", 0);
> +    if ((num_local_procs == num_total_procs) && (num_max_procs <=
> num_total_procs)) {
> +        if (NULL == getenv("PSM2_DEVICES")) {
> +            setenv("PSM2_DEVICES", "self,shm", 0);
> +        }
>      }
>
>      err = psm2_init(&verno_major, &verno_minor);
>
>
>
>
>
> On 9/30/2016 12:38 AM, Cabral, Matias A wrote:
>
> Hi Giles et.al.,
>
> You are right, ptl.c is in PSM2 code. As Ralph mentions, dynamic 
> process support was/is not working in OMPI when using PSM2 because of 
> an issue related to the transport keys. This was fixed in PR #1602
> (https://github.com/open-mpi/ompi/pull/1602) and should be included in 
> v2.0.2. HOWEVER, this not the error Juraj is seeing. The root of the 
> assertion is because the PSM/PSM2 MTLs will check for where the “original”
> process are running and, if detects all are local to the node, it will 
> ONLY initialize the shared memory device (variable PSM2_DEVICES="self,shm” ).
> This is to avoid “reserving” HW resources in the HFI card that 
> wouldn’t be used unless you later on spawn ranks in other nodes.  
> Therefore, to allow dynamic process to be spawned on other nodes you 
> need to tell PSM2 to instruct the HW to initialize all the de devices 
> by making the environment variable PSM2_DEVICES="self,shm,hfi" available 
> before running the job.
> Note that setting PSM2_DEVICES (*) will solve the below assertion, you 
> will most likely still see the transport key issue if PR1602 if is not 
> included.
>
> Thanks,
>
> _MAC
>
> (*)
> PSM2_DEVICES  -> Omni Path
>                 PSM_DEVICES  -> TrueScale
>
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of 
> r...@open-mpi.org
> Sent: Thursday, September 29, 2016 7:12 AM
> To: Open MPI Users <us...@lists.open-mpi.org>
> Subject: Re: [OMPI users] MPI_Comm_spawn
>
> Ah, that may be why it wouldn’t show up in the OMPI code base itself. 
> If that is the case here, then no - OMPI v2.0.1 does not support 
> comm_spawn for PSM. It is fixed in the upcoming 2.0.2
>
>
> On Sep 29, 2016, at 6:58 AM, Gilles Gouaillardet 
> <gilles.gouaillar...@gmail.com> wrote:
>
> Ralph,
>
> My guess is that ptl.c comes from PSM lib ...
>
> Cheers,
>
> Gilles
>
> On Thursday, September 29, 2016, r...@open-mpi.org <r...@open-mpi.org> wrote:
>
> Spawn definitely does not work with srun. I don’t recognize the name 
> of the file that segfaulted - what is “ptl.c”? Is that in your manager 
> program?
>
>
>
> On Sep 29, 2016, at 6:06 AM, Gilles Gouaillardet 
> <gilles.gouaillar...@gmail.com> wrote:
>
> Hi,
>
> I do not expect spawn can work with direct launch (e.g. srun)
>
> Do you have PSM (e.g. Infinipath) hardware ? That could be linked to 
> the failure
>
> Can you please try
>
> mpirun --mca pml ob1 --mca btl tcp,sm,self -np 1 --hostfile my_hosts 
> ./manager 1
>
> and see if it help ?
>
> Note if you have the possibility, I suggest you first try that without 
> slurm, and then within a slurm job
>
> Cheers,
>
> Gilles
>
> On Thursday, September 29, 2016, juraj2...@gmail.com 
> <juraj2...@gmail.com>
> wrote:
>
> Hello,
>
> I am using MPI_Comm_spawn to dynamically create new processes from 
> single manager process. Everything works fine when all the processes 
> are running on the same node. But imposing restriction to run only a 
> single process per node does not work. Below are the errors produced 
> during multinode interactive session and multinode sbatch job.
>
> The system I am using is: Linux version 3.10.0-229.el7.x86_64
> (buil...@kbuilder.dev.centos.org) (gcc version 4.8.2 20140120 (Red Hat
> 4.8.2-16) (GCC) )
> I am using Open MPI 2.0.1
> Slurm is version 15.08.9
>
> What is preventing my jobs to spawn on multiple nodes? Does slurm 
> requires some additional configuration to allow it? Is it issue on the 
> MPI side, does it need to be compiled with some special flag (I have 
> compiled it with --enable-mpi-fortran=all --with-pmi)?
>
> The code I am launching is here: https://github.com/goghino/dynamicMPI
>
> Manager tries to launch one new process (./manager 1), the error 
> produced by requesting each process to be located on different node 
> (interactive
> session):
> $ salloc -N 2
> $ cat my_hosts
> icsnode37
> icsnode38
> $ mpirun -np 1 -npernode 1 --hostfile my_hosts ./manager 1 
> [manager]I'm running MPI 3.1 [manager]Runing on node icsnode37 
> icsnode37.12614Assertion failure at ptl.c:183: epaddr == ((void *)0) 
> icsnode38.32443Assertion failure at ptl.c:183: epaddr == ((void *)0) 
> [icsnode37:12614] *** Process received signal *** [icsnode37:12614] 
> Signal: Aborted (6) [icsnode37:12614] Signal code:  (-6) 
> [icsnode38:32443] *** Process received signal *** [icsnode38:32443] 
> Signal: Aborted (6) [icsnode38:32443] Signal code:  (-6)
>
> The same example as above via sbatch job submission:
> $ cat job.sbatch
> #!/bin/bash
>
> #SBATCH --nodes=2
> #SBATCH --ntasks-per-node=1
>
> module load openmpi/2.0.1
> srun -n 1 -N 1 ./manager 1
>
> $ cat output.o
> [manager]I'm running MPI 3.1
> [manager]Runing on node icsnode39
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> [icsnode39:9692] *** An error occurred in MPI_Comm_spawn 
> [icsnode39:9692] *** reported by process [1007812608,0] 
> [icsnode39:9692] *** on communicator MPI_COMM_SELF [icsnode39:9692] 
> *** MPI_ERR_SPAWN: could not spawn processes [icsnode39:9692] *** 
> MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> [icsnode39:9692] ***    and potentially your MPI job)
> In: PMI_Abort(50, N/A)
> slurmstepd: *** STEP 15378.0 ON icsnode39 CANCELLED AT 
> 2016-09-26T16:48:20
> ***
> srun: error: icsnode39: task 0: Exited with exit code 50
>
> Thank for any feedback!
>
> Best regards,
> Juraj
>
> _______________________________________________
> users mailing list
> us...@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
>
> _______________________________________________
> users mailing list
> us...@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
>
>
> _______________________________________________
> users mailing list
> us...@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
>
>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] mtl/psm2 and $PSM2_DEVICES

Reply via email to