Hi Loris, >From your mail it looks like you are starting an interactive session with srun >but when it comes to actually running python you are not using an MPI runtime. >What does this look like: srun --mpi=openmpi python -c "import tensorflow as tf"
Alan On Fri, 28 Feb 2020 at 13:59, Loris Bennett <[email protected]<mailto:[email protected]>> wrote: Åke Sandgren <[email protected]<mailto:[email protected]>> writes: > On 2/19/20 8:21 AM, Loris Bennett wrote: >> OK, so you have the various PMIx versions installed both within *and* >> separate from EB - that's the bit I was missing. It is obviously a >> slightly clunky solution, but, as you say, not much maintenance needed >> once it has been set up. I think this is what we are going to implement > > Yes, a bit of double work. But we can't rely on the EB installation for > the Slurm setup. There are other use cases that go through Slurm, likt > the WLCG jobs that only rely on their singularity containers... > > And we also make sure that the /lap/(slurm|pmix|uncx|libevent) are > copied onto each systems local disk so it will keep running even when > those file servers are down. > > Btw, that /lap tree is available to look at (including our slurm build > instructions) at /afs/hpc2n.umu.se/lap<http://hpc2n.umu.se/lap> if you have > AFS installed, and if > not there are a couple of web services that map AFS space into http, > just google for it. So, I have everything set up now, - 3 versions of pmix built outside EasyBuild: [build@admin ~]$ ll /trinity/shared/software/pmix/ total 0 drwxr-xr-x 5 build staff 61 Feb 18 14:58 1.2.5 drwxr-xr-x 6 build staff 76 Feb 18 15:03 2.2.3 drwxr-xr-x 7 build staff 91 Feb 19 14:53 3.1.4 - Slurm built against these versions: rpmbuild --define "_with_pmix --with-pmix=/trinity/shared/software/pmix/1.2.5:/trinity/shared/software/pmix/2.2.3:/trinity/shared/software/pmix/3.1.4" -ta slurm-19.05.5.tar.bz2 [build@admin ~]$ srun --mpi=list srun: MPI types are... srun: pmix_v1 srun: none srun: pmix srun: pmix_v3 srun: openmpi srun: pmi2 srun: pmix_v2 - various versions of pmix build within EasyBuild: PMIx/1.2.5-GCCcore-6.4.0 PMIx/2.1.3-GCCcore-8.2.0 PMIx/3.1.1-GCCcore-8.2.0 PMIx/2.1.3-GCCcore-7.3.0 PMIx/2.1.3-GCCcore-8.3.0 PMIx/3.1.4-GCCcore-8.3.0 (D) (I should possibly rebuild Slurm against 2.1.3 rather than 2.2.3) - PMIx dependency added to OpenMPI via hook (thanks, Åke): [build@admin ~]$ module show OpenMPI/3.1.4-gcccuda-2019b ... conflict("OpenMPI") load("gcccuda/2019b") load("zlib/1.2.11-GCCcore-8.3.0") load("hwloc/1.11.12-GCCcore-8.3.0") load("PMIx/2.1.3-GCCcore-8.3.0") - TensorFlow/1.13.1-fosscuda-2019a-Python-3.7.2 built via Easybuild I then start an interactive Slurm job in which I load the TensorFlow module, start python and import tensorflow: - srun --mpi=... - module add TensorFlow/1.13.1-fosscuda-2019a-Python-3.7.2 - python - import tensorflow as tf However I am still unable to get TensorFlow with anything else other than --mpi=openmpi, and even then I get the error about forking: - srun --mpi=none :: srun OK, on import TF: OMPI was not built with SLURM's PMI support, MPI abort - srun --mpi=openmpi :: srun OK, on import TF: OPAL ERROR - call to fork() - srun --mpi=pmi2 :: srun OK, OMPI was not built with SLURM's PMI support, MPI abort - srun --mpi=pmix_v1 :: srun hangs - srun --mpi=pmix_v2 :: srun aborted: Job step aborted before step completely launched - srun --mpi=pmix_v3 :: srun OK, import tensorflow hangs - srun --mpi=pmix :: srun OK, import tensorflow hangs I think this is pretty much the same situation before I went through all the PMIx hoops. Can anyone see what I'm doing wrong? Cheers, Loris -- Dr. Loris Bennett (Mr.) ZEDAT, Freie Universität Berlin Email [email protected]<mailto:[email protected]> -- Dr. Alan O'Cais E-CAM Software Manager Juelich Supercomputing Centre Forschungszentrum Juelich GmbH 52425 Juelich, Germany Phone: +49 2461 61 5213 Fax: +49 2461 61 6656 E-mail: [email protected]<mailto:[email protected]> WWW: http://www.fz-juelich.de/ias/jsc/EN ------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------ Forschungszentrum Juelich GmbH 52425 Juelich Sitz der Gesellschaft: Juelich Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 Vorsitzender des Aufsichtsrats: MinDir Volker Rieke Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender), Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, Prof. Dr. Sebastian M. Schmidt ------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------

