On 2/28/20 1:59 PM, Loris Bennett wrote:
> Åke Sandgren <[email protected]> writes:
>
>> On 2/19/20 8:21 AM, Loris Bennett wrote:
>>> OK, so you have the various PMIx versions installed both within *and*
>>> separate from EB - that's the bit I was missing. It is obviously a
>>> slightly clunky solution, but, as you say, not much maintenance needed
>>> once it has been set up. I think this is what we are going to implement
>>
>> Yes, a bit of double work. But we can't rely on the EB installation for
>> the Slurm setup. There are other use cases that go through Slurm, likt
>> the WLCG jobs that only rely on their singularity containers...
>>
>> And we also make sure that the /lap/(slurm|pmix|uncx|libevent) are
>> copied onto each systems local disk so it will keep running even when
>> those file servers are down.
>>
>> Btw, that /lap tree is available to look at (including our slurm build
>> instructions) at /afs/hpc2n.umu.se/lap if you have AFS installed, and if
>> not there are a couple of web services that map AFS space into http,
>> just google for it.
>
> So, I have everything set up now,
>
> - 3 versions of pmix built outside EasyBuild:
>
> [build@admin ~]$ ll /trinity/shared/software/pmix/
> total 0
> drwxr-xr-x 5 build staff 61 Feb 18 14:58 1.2.5
> drwxr-xr-x 6 build staff 76 Feb 18 15:03 2.2.3
> drwxr-xr-x 7 build staff 91 Feb 19 14:53 3.1.4
>
> - Slurm built against these versions:
>
> rpmbuild --define "_with_pmix
> --with-pmix=/trinity/shared/software/pmix/1.2.5:/trinity/shared/software/pmix/2.2.3:/trinity/shared/software/pmix/3.1.4"
> -ta slurm-19.05.5.tar.bz2
>
> [build@admin ~]$ srun --mpi=list
> srun: MPI types are...
> srun: pmix_v1
> srun: none
> srun: pmix
> srun: pmix_v3
> srun: openmpi
> srun: pmi2
> srun: pmix_v2
>
> - various versions of pmix build within EasyBuild:
>
> PMIx/1.2.5-GCCcore-6.4.0 PMIx/2.1.3-GCCcore-8.2.0
> PMIx/3.1.1-GCCcore-8.2.0
> PMIx/2.1.3-GCCcore-7.3.0 PMIx/2.1.3-GCCcore-8.3.0
> PMIx/3.1.4-GCCcore-8.3.0 (D)
>
> (I should possibly rebuild Slurm against 2.1.3 rather than 2.2.3)
>
> - PMIx dependency added to OpenMPI via hook (thanks, Åke):
>
> [build@admin ~]$ module show OpenMPI/3.1.4-gcccuda-2019b
> ...
> conflict("OpenMPI")
> load("gcccuda/2019b")
> load("zlib/1.2.11-GCCcore-8.3.0")
> load("hwloc/1.11.12-GCCcore-8.3.0")
> load("PMIx/2.1.3-GCCcore-8.3.0")
>
> - TensorFlow/1.13.1-fosscuda-2019a-Python-3.7.2 built via Easybuild
>
> I then start an interactive Slurm job in which I load the TensorFlow
> module, start python and import tensorflow:
>
> - srun --mpi=...
> - module add TensorFlow/1.13.1-fosscuda-2019a-Python-3.7.2
> - python
> - import tensorflow as tf
>
> However I am still unable to get TensorFlow with anything else other
> than --mpi=openmpi, and even then I get the error about forking:
>
> - srun --mpi=none :: srun OK, on import TF: OMPI was not built with
> SLURM's PMI support, MPI abort
> - srun --mpi=openmpi :: srun OK, on import TF: OPAL ERROR - call to fork()
> - srun --mpi=pmi2 :: srun OK, OMPI was not built with SLURM's PMI
> support, MPI abort
> - srun --mpi=pmix_v1 :: srun hangs
> - srun --mpi=pmix_v2 :: srun aborted: Job step aborted before step
> completely launched
> - srun --mpi=pmix_v3 :: srun OK, import tensorflow hangs
> - srun --mpi=pmix :: srun OK, import tensorflow hangs
>
> I think this is pretty much the same situation before I went through all
> the PMIx hoops.
>
> Can anyone see what I'm doing wrong?
Each of our OpenMPI modules expliticly set SLURM_MPI_TYPE to the correct
MPI type, in the above case, pmix_v2
And this works (though it will, as expected, complain about fork)
salloc ... -n 1
ml fosscuda/2019a
ml TensorFlow/1.13.1-Python-3.7.2
srun --pty /bin/bash
python
import tensorflow as tf
No problems there.
This however fails with ENOTTY as expected (look a manpage for srun on
--pty)
salloc -n 2
ml fosscuda/2019a
ml TensorFlow/1.13.1-Python-3.7.2
srun --pty /bin/bash
python
import tensorflow as tf
This however works,
salloc -n 2
ml fosscuda/2019a
ml TensorFlow/1.13.1-Python-3.7.2
srun /bin/bash -c 'python -c "import tensorflow as tf"; echo $?'
Returning to nice "0" from the echo $?
So the question here is exactly what where you doing?
It's much better to write an actual submit file and do stuff in it then
trying to do interactive jobs, especially if they are MPI jobs that want
to talk to stdout/err...
--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: [email protected] Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se