Åke Sandgren <[email protected]> writes:

> On 2/19/20 8:21 AM, Loris Bennett wrote:
>> OK, so you have the various PMIx versions installed both within *and*
>> separate from EB - that's the bit I was missing.  It is obviously a
>> slightly clunky solution, but, as you say, not much maintenance needed
>> once it has been set up.  I think this is what we are going to implement
>
> Yes, a bit of double work. But we can't rely on the EB installation for
> the Slurm setup. There are other use cases that go through Slurm, likt
> the WLCG jobs that only rely on their singularity containers...
>
> And we also make sure that the /lap/(slurm|pmix|uncx|libevent) are
> copied onto each systems local disk so it will keep running even when
> those file servers are down.
>
> Btw, that /lap tree is available to look at (including our slurm build
> instructions) at /afs/hpc2n.umu.se/lap if you have AFS installed, and if
> not there are a couple of web services that map AFS space into http,
> just google for it.

So, I have everything set up now,

  - 3 versions of pmix built outside EasyBuild:
      
    [build@admin ~]$ ll /trinity/shared/software/pmix/
    total 0
    drwxr-xr-x 5 build staff 61 Feb 18 14:58 1.2.5
    drwxr-xr-x 6 build staff 76 Feb 18 15:03 2.2.3
    drwxr-xr-x 7 build staff 91 Feb 19 14:53 3.1.4

  - Slurm built against these versions:

    rpmbuild --define "_with_pmix 
--with-pmix=/trinity/shared/software/pmix/1.2.5:/trinity/shared/software/pmix/2.2.3:/trinity/shared/software/pmix/3.1.4"
 -ta slurm-19.05.5.tar.bz2

    [build@admin ~]$ srun --mpi=list
    srun: MPI types are...
    srun: pmix_v1
    srun: none
    srun: pmix
    srun: pmix_v3
    srun: openmpi
    srun: pmi2
    srun: pmix_v2
  
  - various versions of pmix build within EasyBuild:

    PMIx/1.2.5-GCCcore-6.4.0    PMIx/2.1.3-GCCcore-8.2.0    
PMIx/3.1.1-GCCcore-8.2.0
    PMIx/2.1.3-GCCcore-7.3.0    PMIx/2.1.3-GCCcore-8.3.0    
PMIx/3.1.4-GCCcore-8.3.0 (D)

    (I should possibly rebuild Slurm against 2.1.3 rather than 2.2.3)

  - PMIx dependency added to OpenMPI via hook (thanks, Åke):

    [build@admin ~]$ module show OpenMPI/3.1.4-gcccuda-2019b
    ...
    conflict("OpenMPI")
    load("gcccuda/2019b")
    load("zlib/1.2.11-GCCcore-8.3.0")
    load("hwloc/1.11.12-GCCcore-8.3.0")
    load("PMIx/2.1.3-GCCcore-8.3.0")

  - TensorFlow/1.13.1-fosscuda-2019a-Python-3.7.2 built via Easybuild

I then start an interactive Slurm job in which I load the TensorFlow
module, start python and import tensorflow:

  - srun --mpi=...
  - module add TensorFlow/1.13.1-fosscuda-2019a-Python-3.7.2
  - python
  - import tensorflow as tf

However I am still unable to get TensorFlow with anything else other
than --mpi=openmpi, and even then I get the error about forking:

    - srun --mpi=none :: srun OK, on import TF: OMPI was not built with SLURM's 
PMI support, MPI abort
    - srun --mpi=openmpi :: srun OK, on import TF: OPAL ERROR - call to fork()
    - srun --mpi=pmi2 :: srun OK, OMPI was not built with SLURM's PMI support, 
MPI abort
    - srun --mpi=pmix_v1 :: srun hangs
    - srun --mpi=pmix_v2 :: srun aborted: Job step aborted before step 
completely launched
    - srun --mpi=pmix_v3 :: srun OK, import tensorflow hangs
    - srun --mpi=pmix :: srun OK, import tensorflow hangs

I think this is pretty much the same situation before I went through all
the PMIx hoops.

Can anyone see what I'm doing wrong?

Cheers,

Loris
    
-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin         Email [email protected]

Reply via email to