Hi Alan, You mean start an interactive job with 'srun' and then run python via another 'srun' within the interactive job? If I do that:
$ [login] srun --partition=gputest --qos=medium --ntasks=2 --cpus-per-task=1 --time=07:00:00 --mem=2000 --gres=gpu:1 --mpi=openmpi --pty --x11 bash $ [gpu-node] module add TensorFlow $ [gpu-node] srun --mpi=openmpi python -c "import tensorflow as tf" The second 'srun' hangs. If I start the first 'srun' with '--mpi=none' then the second 'srun' hangs too. Cheers, Loris Alan O'Cais <[email protected]> writes: > Hi Loris, > > From your mail it looks like you are starting an interactive session with > srun but when it comes to actually running python you are not using an MPI > runtime. What does this look like: > srun --mpi=openmpi python -c "import tensorflow as tf" > > Alan > > On Fri, 28 Feb 2020 at 13:59, Loris Bennett <[email protected]> > wrote: > > Åke Sandgren <[email protected]> writes: > > > On 2/19/20 8:21 AM, Loris Bennett wrote: > >> OK, so you have the various PMIx versions installed both within *and* > >> separate from EB - that's the bit I was missing. It is obviously a > >> slightly clunky solution, but, as you say, not much maintenance needed > >> once it has been set up. I think this is what we are going to implement > > > > Yes, a bit of double work. But we can't rely on the EB installation for > > the Slurm setup. There are other use cases that go through Slurm, likt > > the WLCG jobs that only rely on their singularity containers... > > > > And we also make sure that the /lap/(slurm|pmix|uncx|libevent) are > > copied onto each systems local disk so it will keep running even when > > those file servers are down. > > > > Btw, that /lap tree is available to look at (including our slurm build > > instructions) at /afs/hpc2n.umu.se/lap if you have AFS installed, and if > > not there are a couple of web services that map AFS space into http, > > just google for it. > > So, I have everything set up now, > > - 3 versions of pmix built outside EasyBuild: > > [build@admin ~]$ ll /trinity/shared/software/pmix/ > total 0 > drwxr-xr-x 5 build staff 61 Feb 18 14:58 1.2.5 > drwxr-xr-x 6 build staff 76 Feb 18 15:03 2.2.3 > drwxr-xr-x 7 build staff 91 Feb 19 14:53 3.1.4 > > - Slurm built against these versions: > > rpmbuild --define "_with_pmix > --with-pmix=/trinity/shared/software/pmix/1.2.5:/trinity/shared/software/pmix/2.2.3:/trinity/shared/software/pmix/3.1.4" > -ta slurm-19.05.5.tar.bz2 > > [build@admin ~]$ srun --mpi=list > srun: MPI types are... > srun: pmix_v1 > srun: none > srun: pmix > srun: pmix_v3 > srun: openmpi > srun: pmi2 > srun: pmix_v2 > > - various versions of pmix build within EasyBuild: > > PMIx/1.2.5-GCCcore-6.4.0 PMIx/2.1.3-GCCcore-8.2.0 > PMIx/3.1.1-GCCcore-8.2.0 > PMIx/2.1.3-GCCcore-7.3.0 PMIx/2.1.3-GCCcore-8.3.0 > PMIx/3.1.4-GCCcore-8.3.0 (D) > > (I should possibly rebuild Slurm against 2.1.3 rather than 2.2.3) > > - PMIx dependency added to OpenMPI via hook (thanks, Åke): > > [build@admin ~]$ module show OpenMPI/3.1.4-gcccuda-2019b > ... > conflict("OpenMPI") > load("gcccuda/2019b") > load("zlib/1.2.11-GCCcore-8.3.0") > load("hwloc/1.11.12-GCCcore-8.3.0") > load("PMIx/2.1.3-GCCcore-8.3.0") > > - TensorFlow/1.13.1-fosscuda-2019a-Python-3.7.2 built via Easybuild > > I then start an interactive Slurm job in which I load the TensorFlow > module, start python and import tensorflow: > > - srun --mpi=... > - module add TensorFlow/1.13.1-fosscuda-2019a-Python-3.7.2 > - python > - import tensorflow as tf > > However I am still unable to get TensorFlow with anything else other > than --mpi=openmpi, and even then I get the error about forking: > > - srun --mpi=none :: srun OK, on import TF: OMPI was not built with > SLURM's PMI support, MPI abort > - srun --mpi=openmpi :: srun OK, on import TF: OPAL ERROR - call to > fork() > - srun --mpi=pmi2 :: srun OK, OMPI was not built with SLURM's PMI > support, MPI abort > - srun --mpi=pmix_v1 :: srun hangs > - srun --mpi=pmix_v2 :: srun aborted: Job step aborted before step > completely launched > - srun --mpi=pmix_v3 :: srun OK, import tensorflow hangs > - srun --mpi=pmix :: srun OK, import tensorflow hangs > > I think this is pretty much the same situation before I went through all > the PMIx hoops. > > Can anyone see what I'm doing wrong? > > Cheers, > > Loris > > -- > Dr. Loris Bennett (Mr.) > ZEDAT, Freie Universität Berlin Email [email protected] -- Dr. Loris Bennett (Mr.) ZEDAT, Freie Universität Berlin Email [email protected]

