Hi, Thinking about the problem, if it were a question of just rebuilding Slurm with a different version of OpenMPI, then presumably other MPI-programs would have issues with Slurm, but we haven't seen this.
So I am still mystified. Cheers, Loris Loris Bennett <[email protected]> writes: > Hi Kenneth, > > I have tried two different things: > > 1. > > Starting and interactive job as user 'loris' via Slurm on a GPU-node, > loading the TensorFlow module, starting Python and then importing the > python module 'tensorflow'. This triggers the original error below. > > 2. > > Logging directly into to the same GPU node as above as 'root', loading > the TensorFlow module, starting Python and then importing the python > module 'tensorflow'. This triggers the following warning: > > A process has executed an operation involving a call to the > "fork()" system call to create a child process. Open MPI is currently > operating in a condition that could result in memory corruption or > other system errors; your job may hang, crash, or produce silent > data corruption. The use of fork() (or system() or other calls that > create child processes) is strongly discouraged. > > The process that invoked fork was: > > Local host: [[17982,1],0] (PID 47441) > > If you are *absolutely sure* that your application will successfully > and correctly survive a call to fork(), you may disable this warning > by setting the mpi_warn_on_fork MCA parameter to 0. > > I can then, however, successfully start a TensorFlow session: > > >>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True)) > 2020-01-28 15:32:57.120084: I > tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with > properties: > name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 > pciBusID: 0000:5e:00.0 > totalMemory: 10.92GiB freeMemory: 10.75GiB > 2020-01-28 15:32:57.225946: I > tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with > properties: > name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 > pciBusID: 0000:d8:00.0 > totalMemory: 10.92GiB freeMemory: 10.76GiB > 2020-01-28 15:32:57.226648: I > tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu > devices: 0, 1 > 2020-01-28 15:32:58.784466: I > tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect > StreamExecutor with strength 1 edge matrix: > 2020-01-28 15:32:58.784506: I > tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 1 > 2020-01-28 15:32:58.784513: I > tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N Y > 2020-01-28 15:32:58.784517: I > tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: Y N > 2020-01-28 15:32:58.784649: I > tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow > device (/job:localhost/replica:0/task:0/device:GPU:0 with 10386 MB memory) -> > physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:5e:00.0, > compute capability: 6.1) > 2020-01-28 15:32:58.785041: I > tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow > device (/job:localhost/replica:0/task:0/device:GPU:1 with 10398 MB memory) -> > physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:d8:00.0, > compute capability: 6.1) > 2020-01-28 15:32:58.785979: I > tensorflow/core/common_runtime/process_util.cc:71] Creating new thread pool > with default inter op setting: 2. Tune using inter_op_parallelism_threads for > best performance. > Device mapping: > /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce > GTX 1080 Ti, pci bus id: 0000:5e:00.0, compute capability: 6.1 > /job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: GeForce > GTX 1080 Ti, pci bus id: 0000:d8:00.0, compute capability: 6.1 > 2020-01-28 15:32:58.786118: I > tensorflow/core/common_runtime/direct_session.cc:317] Device mapping: > /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce > GTX 1080 Ti, pci bus id: 0000:5e:00.0, compute capability: 6.1 > /job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: GeForce > GTX 1080 Ti, pci bus id: 0000:d8:00.0, compute capability: 6.1 > > So it looks as if the problem is a mismatch between the version of > OpenMPI used to build Slurm (Version 1.10.7 from CentOS 7.7) and the > version loaded by out TensorFlow module (OpenMPI/3.1.3). None of the > values for the '--mpi' option for 'srun' make any difference. > > Perhaps Slurm needs to be rebuilt with OpenMPI 3.1.3, but my very > limited understanding of PMI lead me to believe that this can be used > instead of OpenMPI for starting MPI processes. > > Cheers, > > Loris > > Kenneth Hoste <[email protected]> writes: > >> Hi Loris, >> >> In which type of environment are you hitting this issue? >> >> Is this in a Slurm job environment? Is it an interactive environment? >> >> The problem is that loading the TensorFlow Python module triggers an >> MPI_Init, >> and Slurm doesn't like that for the reasons it mentions. >> >> We've been hitting this on our site too (but maybe only in my own personal >> account, not system-wide), I haven't gotten to the bottom of it yet... >> >> >> regards, >> >> Kenneth >> >> On 22/01/2020 14:22, Loris Bennett wrote: >>> Hi, >>> >>> I have built >>> >>> TensorFlow/1.13.1-fosscuda-2019a-Python-3.7.2 >>> >>> However, when I import the Python module I get the following error >>> >>> $ module add TensorFlow >>> $ python >>> Python 3.7.2 (default, Jun 6 2019, 09:12:17) >>> [GCC 8.2.0] on linux >>> Type "help", "copyright", "credits" or "license" for more information. >>> >>> import tensorflow as tf >>> [g001.curta.zedat.fu-berlin.de:227147] OPAL ERROR: Error in file >>> pmix2x_client.c at line 109 >>> >>> -------------------------------------------------------------------------- >>> The application appears to have been direct launched using "srun", >>> but OMPI was not built with SLURM's PMI support and therefore cannot >>> execute. There are several options for building PMI support under >>> SLURM, depending upon the SLURM version you are using: >>> >>> version 16.05 or later: you can use SLURM's PMIx support. This >>> requires that you configure and build SLURM --with-pmix. >>> >>> Versions earlier than 16.05: you must use either SLURM's PMI-1 or >>> PMI-2 support. SLURM builds PMI-1 by default, or you can manually >>> install PMI-2. You must then build Open MPI using --with-pmi pointing >>> to the SLURM PMI library location. >>> >>> Please configure as appropriate and try again. >>> >>> -------------------------------------------------------------------------- >>> *** An error occurred in MPI_Init_thread >>> *** on a NULL communicator >>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>> *** and potentially your MPI job) >>> [g001.curta.zedat.fu-berlin.de:227147] Local abort before MPI_INIT >>> completed completed successfully, but am not able to aggregate error >>> messages, and not able to guarantee that all other processes were killed! >>> >>> With a bit of googling I found this: >>> >>> https://gist.github.com/boegel/c3605eb614916af4a6243ae91fd29b33 >>> >>> Is this indeed an EB problem? >>> >>> Not having any understanding of TensorFlow, I don't know why just >>> loading the Python module causes a Slurm job to be launched. >>> >>> Cheers, >>> >>> Loris >>> >> -- Dr. Loris Bennett (Mr.) ZEDAT, Freie Universität Berlin Email [email protected]

