Loris Bennett <[email protected]> writes: > Dear Andreas, > > I am not aware that we have ever built OpenMPI against a particular > version of Slurm before (before being in the last 10 years or so). > My very poor understanding is that both OpenMPI and Slurm support some > variants of PMI. As long as there is some overlap between the versions > supported, Slurm and OpenMPI should be able to work together. > > What we have done recently is to specify the PMI variants when building > Slurm and currently have this: > > $ srun --mpi=list > srun: MPI types are... > srun: pmix_v1 > srun: none > srun: pmix > srun: openmpi > srun: pmi2 > > As indicated here > > https://pmix.org/support/faq/which-environments-include-support-for-pmix/ > > recent versions of Slurm can support up to pmix_v3. Also on that page > it states the following > > Open MPI v3.x: PMIx v2.x > > The OpenMPI dependency for my TensorFlow module is OpenMPI/3.1.3, so > perhaps it's just that we need to build Slurm with support vorn PMIx v2.
With quite a lot of help from Åke I think I have worked out some of what I need to do, namely: 1. Build each Version of OpenMPI against the correct known working version of PMIx. 2. Build Slurm with PMIx support for multiple PMIx versions using some thing like rpmbuild --define "_with_pmix --with-pmix=/sw/PMIx/1.2.5-GCCcore-6.4.0:/sw/PMIx/2.2.3-GCCcore-7.3.0:/sw/PMIx/3.1.1-GCCcore-8.2.0:" -ta slurm-19.05.5.tar.bz2 What I am still not sure about is how to deal with toolchains which have the major version of OpenMPI, but obviously different versions of the compiler. Thus if I choose 2.2.3 as the version of PMIx for OpenMPI 3.x, then I have the following dependencies: foss-2018b <- OpenMPI-3.1.1-GCC-7.3.0-2.30 <- PMIx/2.2.3-GCCcore-7.3.0 foss-2019a <- OpenMPI-3.1.3-GCC-8.2.0-2.31.1 <- PMIx/2.2.3-GCCcore-8.2.0 However, it is not clear to me what Slurm will do, since all the PMIx versions provide libpmi.so libpmi2.so libpmix.so It seems that the -with-pmix option is more about choosing between the libpmi.so and libpmi2.so provided by the slurm-libpmi package and those provided by PMIx. However, the different version of OpenMPI seem to require different versions of PMIx. So I guess my question is: What should the argument for the --with-pmix option be? (Seems like that's more of a Slurm question rather than an EB one - sorry about that) Cheers, Loris > > "Henkel, Andreas" <[email protected]> writes: > >> Dear Loris, >> >> Isn’t it the opposite? Openmpi has to be built with Slurm properly? >> We had several issues with openmpi when it was compiled with dL-open and >> similar. I‘d have to check exact configuration flags . >> Best >> Andreas >> >>> Am 29.01.2020 um 14:40 schrieb Loris Bennett <[email protected]>: >>> >>> Hi, >>> >>> Thinking about the problem, if it were a question of just rebuilding >>> Slurm with a different version of OpenMPI, then presumably other >>> MPI-programs would have issues with Slurm, but we haven't seen this. >>> >>> So I am still mystified. >>> >>> Cheers, >>> >>> Loris >>> >>> Loris Bennett <[email protected]> writes: >>> >>>> Hi Kenneth, >>>> >>>> I have tried two different things: >>>> >>>> 1. >>>> >>>> Starting and interactive job as user 'loris' via Slurm on a GPU-node, >>>> loading the TensorFlow module, starting Python and then importing the >>>> python module 'tensorflow'. This triggers the original error below. >>>> >>>> 2. >>>> >>>> Logging directly into to the same GPU node as above as 'root', loading >>>> the TensorFlow module, starting Python and then importing the python >>>> module 'tensorflow'. This triggers the following warning: >>>> >>>> A process has executed an operation involving a call to the >>>> "fork()" system call to create a child process. Open MPI is currently >>>> operating in a condition that could result in memory corruption or >>>> other system errors; your job may hang, crash, or produce silent >>>> data corruption. The use of fork() (or system() or other calls that >>>> create child processes) is strongly discouraged. >>>> >>>> The process that invoked fork was: >>>> >>>> Local host: [[17982,1],0] (PID 47441) >>>> >>>> If you are *absolutely sure* that your application will successfully >>>> and correctly survive a call to fork(), you may disable this warning >>>> by setting the mpi_warn_on_fork MCA parameter to 0. >>>> >>>> I can then, however, successfully start a TensorFlow session: >>>> >>>>>>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True)) >>>> 2020-01-28 15:32:57.120084: I >>>> tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with >>>> properties: >>>> name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 >>>> pciBusID: 0000:5e:00.0 >>>> totalMemory: 10.92GiB freeMemory: 10.75GiB >>>> 2020-01-28 15:32:57.225946: I >>>> tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with >>>> properties: >>>> name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 >>>> pciBusID: 0000:d8:00.0 >>>> totalMemory: 10.92GiB freeMemory: 10.76GiB >>>> 2020-01-28 15:32:57.226648: I >>>> tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu >>>> devices: 0, 1 >>>> 2020-01-28 15:32:58.784466: I >>>> tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect >>>> StreamExecutor with strength 1 edge matrix: >>>> 2020-01-28 15:32:58.784506: I >>>> tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 1 >>>> 2020-01-28 15:32:58.784513: I >>>> tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N Y >>>> 2020-01-28 15:32:58.784517: I >>>> tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: Y N >>>> 2020-01-28 15:32:58.784649: I >>>> tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow >>>> device (/job:localhost/replica:0/task:0/device:GPU:0 with 10386 MB memory) >>>> -> >>>> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: >>>> 0000:5e:00.0, >>>> compute capability: 6.1) >>>> 2020-01-28 15:32:58.785041: I >>>> tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow >>>> device (/job:localhost/replica:0/task:0/device:GPU:1 with 10398 MB memory) >>>> -> >>>> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: >>>> 0000:d8:00.0, >>>> compute capability: 6.1) >>>> 2020-01-28 15:32:58.785979: I >>>> tensorflow/core/common_runtime/process_util.cc:71] Creating new thread pool >>>> with default inter op setting: 2. Tune using inter_op_parallelism_threads >>>> for >>>> best performance. >>>> Device mapping: >>>> /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce >>>> GTX 1080 Ti, pci bus id: 0000:5e:00.0, compute capability: 6.1 >>>> /job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: GeForce >>>> GTX 1080 Ti, pci bus id: 0000:d8:00.0, compute capability: 6.1 >>>> 2020-01-28 15:32:58.786118: I >>>> tensorflow/core/common_runtime/direct_session.cc:317] Device mapping: >>>> /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce >>>> GTX 1080 Ti, pci bus id: 0000:5e:00.0, compute capability: 6.1 >>>> /job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: GeForce >>>> GTX 1080 Ti, pci bus id: 0000:d8:00.0, compute capability: 6.1 >>>> >>>> So it looks as if the problem is a mismatch between the version of >>>> OpenMPI used to build Slurm (Version 1.10.7 from CentOS 7.7) and the >>>> version loaded by out TensorFlow module (OpenMPI/3.1.3). None of the >>>> values for the '--mpi' option for 'srun' make any difference. >>>> >>>> Perhaps Slurm needs to be rebuilt with OpenMPI 3.1.3, but my very >>>> limited understanding of PMI lead me to believe that this can be used >>>> instead of OpenMPI for starting MPI processes. >>>> >>>> Cheers, >>>> >>>> Loris >>>> >>>> Kenneth Hoste <[email protected]> writes: >>>> >>>>> Hi Loris, >>>>> >>>>> In which type of environment are you hitting this issue? >>>>> >>>>> Is this in a Slurm job environment? Is it an interactive environment? >>>>> >>>>> The problem is that loading the TensorFlow Python module triggers an >>>>> MPI_Init, >>>>> and Slurm doesn't like that for the reasons it mentions. >>>>> >>>>> We've been hitting this on our site too (but maybe only in my own personal >>>>> account, not system-wide), I haven't gotten to the bottom of it yet... >>>>> >>>>> >>>>> regards, >>>>> >>>>> Kenneth >>>>> >>>>>> On 22/01/2020 14:22, Loris Bennett wrote: >>>>>> Hi, >>>>>> >>>>>> I have built >>>>>> >>>>>> TensorFlow/1.13.1-fosscuda-2019a-Python-3.7.2 >>>>>> >>>>>> However, when I import the Python module I get the following error >>>>>> >>>>>> $ module add TensorFlow >>>>>> $ python >>>>>> Python 3.7.2 (default, Jun 6 2019, 09:12:17) >>>>>> [GCC 8.2.0] on linux >>>>>> Type "help", "copyright", "credits" or "license" for more information. >>>>>>>>> import tensorflow as tf >>>>>> [g001.curta.zedat.fu-berlin.de:227147] OPAL ERROR: Error in file >>>>>> pmix2x_client.c at line 109 >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> The application appears to have been direct launched using "srun", >>>>>> but OMPI was not built with SLURM's PMI support and therefore cannot >>>>>> execute. There are several options for building PMI support under >>>>>> SLURM, depending upon the SLURM version you are using: >>>>>> >>>>>> version 16.05 or later: you can use SLURM's PMIx support. This >>>>>> requires that you configure and build SLURM --with-pmix. >>>>>> >>>>>> Versions earlier than 16.05: you must use either SLURM's PMI-1 or >>>>>> PMI-2 support. SLURM builds PMI-1 by default, or you can manually >>>>>> install PMI-2. You must then build Open MPI using --with-pmi pointing >>>>>> to the SLURM PMI library location. >>>>>> >>>>>> Please configure as appropriate and try again. >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> *** An error occurred in MPI_Init_thread >>>>>> *** on a NULL communicator >>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now >>>>>> abort, >>>>>> *** and potentially your MPI job) >>>>>> [g001.curta.zedat.fu-berlin.de:227147] Local abort before MPI_INIT >>>>>> completed completed successfully, but am not able to aggregate error >>>>>> messages, and not able to guarantee that all other processes were killed! >>>>>> >>>>>> With a bit of googling I found this: >>>>>> >>>>>> https://gist.github.com/boegel/c3605eb614916af4a6243ae91fd29b33 >>>>>> >>>>>> Is this indeed an EB problem? >>>>>> >>>>>> Not having any understanding of TensorFlow, I don't know why just >>>>>> loading the Python module causes a Slurm job to be launched. >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Loris >>>>>> >>>>> >>> -- >>> Dr. Loris Bennett (Mr.) >>> ZEDAT, Freie Universität Berlin Email [email protected] -- Dr. Loris Bennett (Mr.) ZEDAT, Freie Universität Berlin Email [email protected]

