Dear Andreas, I am not aware that we have ever built OpenMPI against a particular version of Slurm before (before being in the last 10 years or so). My very poor understanding is that both OpenMPI and Slurm support some variants of PMI. As long as there is some overlap between the versions supported, Slurm and OpenMPI should be able to work together.
What we have done recently is to specify the PMI variants when building Slurm and currently have this: $ srun --mpi=list srun: MPI types are... srun: pmix_v1 srun: none srun: pmix srun: openmpi srun: pmi2 As indicated here https://pmix.org/support/faq/which-environments-include-support-for-pmix/ recent versions of Slurm can support up to pmix_v3. Also on that page it states the following Open MPI v3.x: PMIx v2.x The OpenMPI dependency for my TensorFlow module is OpenMPI/3.1.3, so perhaps it's just that we need to build Slurm with support vorn PMIx v2. Cheers, Loris "Henkel, Andreas" <[email protected]> writes: > Dear Loris, > > Isn’t it the opposite? Openmpi has to be built with Slurm properly? > We had several issues with openmpi when it was compiled with dL-open and > similar. I‘d have to check exact configuration flags . > Best > Andreas > >> Am 29.01.2020 um 14:40 schrieb Loris Bennett <[email protected]>: >> >> Hi, >> >> Thinking about the problem, if it were a question of just rebuilding >> Slurm with a different version of OpenMPI, then presumably other >> MPI-programs would have issues with Slurm, but we haven't seen this. >> >> So I am still mystified. >> >> Cheers, >> >> Loris >> >> Loris Bennett <[email protected]> writes: >> >>> Hi Kenneth, >>> >>> I have tried two different things: >>> >>> 1. >>> >>> Starting and interactive job as user 'loris' via Slurm on a GPU-node, >>> loading the TensorFlow module, starting Python and then importing the >>> python module 'tensorflow'. This triggers the original error below. >>> >>> 2. >>> >>> Logging directly into to the same GPU node as above as 'root', loading >>> the TensorFlow module, starting Python and then importing the python >>> module 'tensorflow'. This triggers the following warning: >>> >>> A process has executed an operation involving a call to the >>> "fork()" system call to create a child process. Open MPI is currently >>> operating in a condition that could result in memory corruption or >>> other system errors; your job may hang, crash, or produce silent >>> data corruption. The use of fork() (or system() or other calls that >>> create child processes) is strongly discouraged. >>> >>> The process that invoked fork was: >>> >>> Local host: [[17982,1],0] (PID 47441) >>> >>> If you are *absolutely sure* that your application will successfully >>> and correctly survive a call to fork(), you may disable this warning >>> by setting the mpi_warn_on_fork MCA parameter to 0. >>> >>> I can then, however, successfully start a TensorFlow session: >>> >>>>>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True)) >>> 2020-01-28 15:32:57.120084: I >>> tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with >>> properties: >>> name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 >>> pciBusID: 0000:5e:00.0 >>> totalMemory: 10.92GiB freeMemory: 10.75GiB >>> 2020-01-28 15:32:57.225946: I >>> tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with >>> properties: >>> name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 >>> pciBusID: 0000:d8:00.0 >>> totalMemory: 10.92GiB freeMemory: 10.76GiB >>> 2020-01-28 15:32:57.226648: I >>> tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu >>> devices: 0, 1 >>> 2020-01-28 15:32:58.784466: I >>> tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect >>> StreamExecutor with strength 1 edge matrix: >>> 2020-01-28 15:32:58.784506: I >>> tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 1 >>> 2020-01-28 15:32:58.784513: I >>> tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N Y >>> 2020-01-28 15:32:58.784517: I >>> tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: Y N >>> 2020-01-28 15:32:58.784649: I >>> tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow >>> device (/job:localhost/replica:0/task:0/device:GPU:0 with 10386 MB memory) >>> -> >>> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: >>> 0000:5e:00.0, >>> compute capability: 6.1) >>> 2020-01-28 15:32:58.785041: I >>> tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow >>> device (/job:localhost/replica:0/task:0/device:GPU:1 with 10398 MB memory) >>> -> >>> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: >>> 0000:d8:00.0, >>> compute capability: 6.1) >>> 2020-01-28 15:32:58.785979: I >>> tensorflow/core/common_runtime/process_util.cc:71] Creating new thread pool >>> with default inter op setting: 2. Tune using inter_op_parallelism_threads >>> for >>> best performance. >>> Device mapping: >>> /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce >>> GTX 1080 Ti, pci bus id: 0000:5e:00.0, compute capability: 6.1 >>> /job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: GeForce >>> GTX 1080 Ti, pci bus id: 0000:d8:00.0, compute capability: 6.1 >>> 2020-01-28 15:32:58.786118: I >>> tensorflow/core/common_runtime/direct_session.cc:317] Device mapping: >>> /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce >>> GTX 1080 Ti, pci bus id: 0000:5e:00.0, compute capability: 6.1 >>> /job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: GeForce >>> GTX 1080 Ti, pci bus id: 0000:d8:00.0, compute capability: 6.1 >>> >>> So it looks as if the problem is a mismatch between the version of >>> OpenMPI used to build Slurm (Version 1.10.7 from CentOS 7.7) and the >>> version loaded by out TensorFlow module (OpenMPI/3.1.3). None of the >>> values for the '--mpi' option for 'srun' make any difference. >>> >>> Perhaps Slurm needs to be rebuilt with OpenMPI 3.1.3, but my very >>> limited understanding of PMI lead me to believe that this can be used >>> instead of OpenMPI for starting MPI processes. >>> >>> Cheers, >>> >>> Loris >>> >>> Kenneth Hoste <[email protected]> writes: >>> >>>> Hi Loris, >>>> >>>> In which type of environment are you hitting this issue? >>>> >>>> Is this in a Slurm job environment? Is it an interactive environment? >>>> >>>> The problem is that loading the TensorFlow Python module triggers an >>>> MPI_Init, >>>> and Slurm doesn't like that for the reasons it mentions. >>>> >>>> We've been hitting this on our site too (but maybe only in my own personal >>>> account, not system-wide), I haven't gotten to the bottom of it yet... >>>> >>>> >>>> regards, >>>> >>>> Kenneth >>>> >>>>> On 22/01/2020 14:22, Loris Bennett wrote: >>>>> Hi, >>>>> >>>>> I have built >>>>> >>>>> TensorFlow/1.13.1-fosscuda-2019a-Python-3.7.2 >>>>> >>>>> However, when I import the Python module I get the following error >>>>> >>>>> $ module add TensorFlow >>>>> $ python >>>>> Python 3.7.2 (default, Jun 6 2019, 09:12:17) >>>>> [GCC 8.2.0] on linux >>>>> Type "help", "copyright", "credits" or "license" for more information. >>>>>>>> import tensorflow as tf >>>>> [g001.curta.zedat.fu-berlin.de:227147] OPAL ERROR: Error in file >>>>> pmix2x_client.c at line 109 >>>>> >>>>> -------------------------------------------------------------------------- >>>>> The application appears to have been direct launched using "srun", >>>>> but OMPI was not built with SLURM's PMI support and therefore cannot >>>>> execute. There are several options for building PMI support under >>>>> SLURM, depending upon the SLURM version you are using: >>>>> >>>>> version 16.05 or later: you can use SLURM's PMIx support. This >>>>> requires that you configure and build SLURM --with-pmix. >>>>> >>>>> Versions earlier than 16.05: you must use either SLURM's PMI-1 or >>>>> PMI-2 support. SLURM builds PMI-1 by default, or you can manually >>>>> install PMI-2. You must then build Open MPI using --with-pmi pointing >>>>> to the SLURM PMI library location. >>>>> >>>>> Please configure as appropriate and try again. >>>>> >>>>> -------------------------------------------------------------------------- >>>>> *** An error occurred in MPI_Init_thread >>>>> *** on a NULL communicator >>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>>>> *** and potentially your MPI job) >>>>> [g001.curta.zedat.fu-berlin.de:227147] Local abort before MPI_INIT >>>>> completed completed successfully, but am not able to aggregate error >>>>> messages, and not able to guarantee that all other processes were killed! >>>>> >>>>> With a bit of googling I found this: >>>>> >>>>> https://gist.github.com/boegel/c3605eb614916af4a6243ae91fd29b33 >>>>> >>>>> Is this indeed an EB problem? >>>>> >>>>> Not having any understanding of TensorFlow, I don't know why just >>>>> loading the Python module causes a Slurm job to be launched. >>>>> >>>>> Cheers, >>>>> >>>>> Loris >>>>> >>>> >> -- >> Dr. Loris Bennett (Mr.) >> ZEDAT, Freie Universität Berlin Email [email protected] -- Dr. Loris Bennett (Mr.) ZEDAT, Freie Universität Berlin Email [email protected]

