Åke Sandgren <[email protected]> writes:
> On 2/19/20 8:21 AM, Loris Bennett wrote:
>> OK, so you have the various PMIx versions installed both within *and*
>> separate from EB - that's the bit I was missing. It is obviously a
>> slightly clunky solution, but, as you say, not much maintenance needed
>> once it has been set up. I think this is what we are going to implement
>
> Yes, a bit of double work. But we can't rely on the EB installation for
> the Slurm setup. There are other use cases that go through Slurm, likt
> the WLCG jobs that only rely on their singularity containers...
>
> And we also make sure that the /lap/(slurm|pmix|uncx|libevent) are
> copied onto each systems local disk so it will keep running even when
> those file servers are down.
>
> Btw, that /lap tree is available to look at (including our slurm build
> instructions) at /afs/hpc2n.umu.se/lap if you have AFS installed, and if
> not there are a couple of web services that map AFS space into http,
> just google for it.
So, I have everything set up now,
- 3 versions of pmix built outside EasyBuild:
[build@admin ~]$ ll /trinity/shared/software/pmix/
total 0
drwxr-xr-x 5 build staff 61 Feb 18 14:58 1.2.5
drwxr-xr-x 6 build staff 76 Feb 18 15:03 2.2.3
drwxr-xr-x 7 build staff 91 Feb 19 14:53 3.1.4
- Slurm built against these versions:
rpmbuild --define "_with_pmix
--with-pmix=/trinity/shared/software/pmix/1.2.5:/trinity/shared/software/pmix/2.2.3:/trinity/shared/software/pmix/3.1.4"
-ta slurm-19.05.5.tar.bz2
[build@admin ~]$ srun --mpi=list
srun: MPI types are...
srun: pmix_v1
srun: none
srun: pmix
srun: pmix_v3
srun: openmpi
srun: pmi2
srun: pmix_v2
- various versions of pmix build within EasyBuild:
PMIx/1.2.5-GCCcore-6.4.0 PMIx/2.1.3-GCCcore-8.2.0
PMIx/3.1.1-GCCcore-8.2.0
PMIx/2.1.3-GCCcore-7.3.0 PMIx/2.1.3-GCCcore-8.3.0
PMIx/3.1.4-GCCcore-8.3.0 (D)
(I should possibly rebuild Slurm against 2.1.3 rather than 2.2.3)
- PMIx dependency added to OpenMPI via hook (thanks, Åke):
[build@admin ~]$ module show OpenMPI/3.1.4-gcccuda-2019b
...
conflict("OpenMPI")
load("gcccuda/2019b")
load("zlib/1.2.11-GCCcore-8.3.0")
load("hwloc/1.11.12-GCCcore-8.3.0")
load("PMIx/2.1.3-GCCcore-8.3.0")
- TensorFlow/1.13.1-fosscuda-2019a-Python-3.7.2 built via Easybuild
I then start an interactive Slurm job in which I load the TensorFlow
module, start python and import tensorflow:
- srun --mpi=...
- module add TensorFlow/1.13.1-fosscuda-2019a-Python-3.7.2
- python
- import tensorflow as tf
However I am still unable to get TensorFlow with anything else other
than --mpi=openmpi, and even then I get the error about forking:
- srun --mpi=none :: srun OK, on import TF: OMPI was not built with SLURM's
PMI support, MPI abort
- srun --mpi=openmpi :: srun OK, on import TF: OPAL ERROR - call to fork()
- srun --mpi=pmi2 :: srun OK, OMPI was not built with SLURM's PMI support,
MPI abort
- srun --mpi=pmix_v1 :: srun hangs
- srun --mpi=pmix_v2 :: srun aborted: Job step aborted before step
completely launched
- srun --mpi=pmix_v3 :: srun OK, import tensorflow hangs
- srun --mpi=pmix :: srun OK, import tensorflow hangs
I think this is pretty much the same situation before I went through all
the PMIx hoops.
Can anyone see what I'm doing wrong?
Cheers,
Loris
--
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email [email protected]