Hi Alan,

You mean start an interactive job with 'srun' and then run python via
another 'srun' within the interactive job?  If I do that:

  $ [login] srun --partition=gputest --qos=medium --ntasks=2 --cpus-per-task=1 
--time=07:00:00 --mem=2000 --gres=gpu:1 --mpi=openmpi --pty --x11 bash
  $ [gpu-node] module add TensorFlow
  $ [gpu-node] srun --mpi=openmpi python -c "import tensorflow as tf"

The second 'srun' hangs.  If I start the first 'srun' with '--mpi=none'
then the second 'srun' hangs too.

Cheers,

Loris


Alan O'Cais <[email protected]> writes:

> Hi Loris, 
>
> From your mail it looks like you are starting an interactive session with 
> srun but when it comes to actually running python you are not using an MPI 
> runtime. What does this look like:
> srun --mpi=openmpi python -c "import tensorflow as tf"
>
> Alan
>
> On Fri, 28 Feb 2020 at 13:59, Loris Bennett <[email protected]> 
> wrote:
>
>  Åke Sandgren <[email protected]> writes:
>
>  > On 2/19/20 8:21 AM, Loris Bennett wrote:
>  >> OK, so you have the various PMIx versions installed both within *and*
>  >> separate from EB - that's the bit I was missing.  It is obviously a
>  >> slightly clunky solution, but, as you say, not much maintenance needed
>  >> once it has been set up.  I think this is what we are going to implement
>  >
>  > Yes, a bit of double work. But we can't rely on the EB installation for
>  > the Slurm setup. There are other use cases that go through Slurm, likt
>  > the WLCG jobs that only rely on their singularity containers...
>  >
>  > And we also make sure that the /lap/(slurm|pmix|uncx|libevent) are
>  > copied onto each systems local disk so it will keep running even when
>  > those file servers are down.
>  >
>  > Btw, that /lap tree is available to look at (including our slurm build
>  > instructions) at /afs/hpc2n.umu.se/lap if you have AFS installed, and if
>  > not there are a couple of web services that map AFS space into http,
>  > just google for it.
>
>  So, I have everything set up now,
>
>    - 3 versions of pmix built outside EasyBuild:
>
>      [build@admin ~]$ ll /trinity/shared/software/pmix/
>      total 0
>      drwxr-xr-x 5 build staff 61 Feb 18 14:58 1.2.5
>      drwxr-xr-x 6 build staff 76 Feb 18 15:03 2.2.3
>      drwxr-xr-x 7 build staff 91 Feb 19 14:53 3.1.4
>
>    - Slurm built against these versions:
>
>      rpmbuild --define "_with_pmix 
> --with-pmix=/trinity/shared/software/pmix/1.2.5:/trinity/shared/software/pmix/2.2.3:/trinity/shared/software/pmix/3.1.4"
>  -ta slurm-19.05.5.tar.bz2
>
>      [build@admin ~]$ srun --mpi=list
>      srun: MPI types are...
>      srun: pmix_v1
>      srun: none
>      srun: pmix
>      srun: pmix_v3
>      srun: openmpi
>      srun: pmi2
>      srun: pmix_v2
>
>    - various versions of pmix build within EasyBuild:
>
>      PMIx/1.2.5-GCCcore-6.4.0    PMIx/2.1.3-GCCcore-8.2.0    
> PMIx/3.1.1-GCCcore-8.2.0
>      PMIx/2.1.3-GCCcore-7.3.0    PMIx/2.1.3-GCCcore-8.3.0    
> PMIx/3.1.4-GCCcore-8.3.0 (D)
>
>      (I should possibly rebuild Slurm against 2.1.3 rather than 2.2.3)
>
>    - PMIx dependency added to OpenMPI via hook (thanks, Åke):
>
>      [build@admin ~]$ module show OpenMPI/3.1.4-gcccuda-2019b
>      ...
>      conflict("OpenMPI")
>      load("gcccuda/2019b")
>      load("zlib/1.2.11-GCCcore-8.3.0")
>      load("hwloc/1.11.12-GCCcore-8.3.0")
>      load("PMIx/2.1.3-GCCcore-8.3.0")
>
>    - TensorFlow/1.13.1-fosscuda-2019a-Python-3.7.2 built via Easybuild
>
>  I then start an interactive Slurm job in which I load the TensorFlow
>  module, start python and import tensorflow:
>
>    - srun --mpi=...
>    - module add TensorFlow/1.13.1-fosscuda-2019a-Python-3.7.2
>    - python
>    - import tensorflow as tf
>
>  However I am still unable to get TensorFlow with anything else other
>  than --mpi=openmpi, and even then I get the error about forking:
>
>      - srun --mpi=none :: srun OK, on import TF: OMPI was not built with 
> SLURM's PMI support, MPI abort
>      - srun --mpi=openmpi :: srun OK, on import TF: OPAL ERROR - call to 
> fork()
>      - srun --mpi=pmi2 :: srun OK, OMPI was not built with SLURM's PMI 
> support, MPI abort
>      - srun --mpi=pmix_v1 :: srun hangs
>      - srun --mpi=pmix_v2 :: srun aborted: Job step aborted before step 
> completely launched
>      - srun --mpi=pmix_v3 :: srun OK, import tensorflow hangs
>      - srun --mpi=pmix :: srun OK, import tensorflow hangs
>
>  I think this is pretty much the same situation before I went through all
>  the PMIx hoops.
>
>  Can anyone see what I'm doing wrong?
>
>  Cheers,
>
>  Loris
>
>  -- 
>  Dr. Loris Bennett (Mr.)
>  ZEDAT, Freie Universität Berlin         Email [email protected]
-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin         Email [email protected]

Reply via email to