Dear Loris, Isn’t it the opposite? Openmpi has to be built with Slurm properly? We had several issues with openmpi when it was compiled with dL-open and similar. I‘d have to check exact configuration flags . Best Andreas
> Am 29.01.2020 um 14:40 schrieb Loris Bennett <[email protected]>: > > Hi, > > Thinking about the problem, if it were a question of just rebuilding > Slurm with a different version of OpenMPI, then presumably other > MPI-programs would have issues with Slurm, but we haven't seen this. > > So I am still mystified. > > Cheers, > > Loris > > Loris Bennett <[email protected]> writes: > >> Hi Kenneth, >> >> I have tried two different things: >> >> 1. >> >> Starting and interactive job as user 'loris' via Slurm on a GPU-node, >> loading the TensorFlow module, starting Python and then importing the >> python module 'tensorflow'. This triggers the original error below. >> >> 2. >> >> Logging directly into to the same GPU node as above as 'root', loading >> the TensorFlow module, starting Python and then importing the python >> module 'tensorflow'. This triggers the following warning: >> >> A process has executed an operation involving a call to the >> "fork()" system call to create a child process. Open MPI is currently >> operating in a condition that could result in memory corruption or >> other system errors; your job may hang, crash, or produce silent >> data corruption. The use of fork() (or system() or other calls that >> create child processes) is strongly discouraged. >> >> The process that invoked fork was: >> >> Local host: [[17982,1],0] (PID 47441) >> >> If you are *absolutely sure* that your application will successfully >> and correctly survive a call to fork(), you may disable this warning >> by setting the mpi_warn_on_fork MCA parameter to 0. >> >> I can then, however, successfully start a TensorFlow session: >> >>>>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True)) >> 2020-01-28 15:32:57.120084: I >> tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with >> properties: >> name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 >> pciBusID: 0000:5e:00.0 >> totalMemory: 10.92GiB freeMemory: 10.75GiB >> 2020-01-28 15:32:57.225946: I >> tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with >> properties: >> name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 >> pciBusID: 0000:d8:00.0 >> totalMemory: 10.92GiB freeMemory: 10.76GiB >> 2020-01-28 15:32:57.226648: I >> tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu >> devices: 0, 1 >> 2020-01-28 15:32:58.784466: I >> tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect >> StreamExecutor with strength 1 edge matrix: >> 2020-01-28 15:32:58.784506: I >> tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 1 >> 2020-01-28 15:32:58.784513: I >> tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N Y >> 2020-01-28 15:32:58.784517: I >> tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: Y N >> 2020-01-28 15:32:58.784649: I >> tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow >> device (/job:localhost/replica:0/task:0/device:GPU:0 with 10386 MB memory) >> -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: >> 0000:5e:00.0, compute capability: 6.1) >> 2020-01-28 15:32:58.785041: I >> tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow >> device (/job:localhost/replica:0/task:0/device:GPU:1 with 10398 MB memory) >> -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: >> 0000:d8:00.0, compute capability: 6.1) >> 2020-01-28 15:32:58.785979: I >> tensorflow/core/common_runtime/process_util.cc:71] Creating new thread pool >> with default inter op setting: 2. Tune using inter_op_parallelism_threads >> for best performance. >> Device mapping: >> /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce >> GTX 1080 Ti, pci bus id: 0000:5e:00.0, compute capability: 6.1 >> /job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: GeForce >> GTX 1080 Ti, pci bus id: 0000:d8:00.0, compute capability: 6.1 >> 2020-01-28 15:32:58.786118: I >> tensorflow/core/common_runtime/direct_session.cc:317] Device mapping: >> /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce >> GTX 1080 Ti, pci bus id: 0000:5e:00.0, compute capability: 6.1 >> /job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: GeForce >> GTX 1080 Ti, pci bus id: 0000:d8:00.0, compute capability: 6.1 >> >> So it looks as if the problem is a mismatch between the version of >> OpenMPI used to build Slurm (Version 1.10.7 from CentOS 7.7) and the >> version loaded by out TensorFlow module (OpenMPI/3.1.3). None of the >> values for the '--mpi' option for 'srun' make any difference. >> >> Perhaps Slurm needs to be rebuilt with OpenMPI 3.1.3, but my very >> limited understanding of PMI lead me to believe that this can be used >> instead of OpenMPI for starting MPI processes. >> >> Cheers, >> >> Loris >> >> Kenneth Hoste <[email protected]> writes: >> >>> Hi Loris, >>> >>> In which type of environment are you hitting this issue? >>> >>> Is this in a Slurm job environment? Is it an interactive environment? >>> >>> The problem is that loading the TensorFlow Python module triggers an >>> MPI_Init, >>> and Slurm doesn't like that for the reasons it mentions. >>> >>> We've been hitting this on our site too (but maybe only in my own personal >>> account, not system-wide), I haven't gotten to the bottom of it yet... >>> >>> >>> regards, >>> >>> Kenneth >>> >>>> On 22/01/2020 14:22, Loris Bennett wrote: >>>> Hi, >>>> >>>> I have built >>>> >>>> TensorFlow/1.13.1-fosscuda-2019a-Python-3.7.2 >>>> >>>> However, when I import the Python module I get the following error >>>> >>>> $ module add TensorFlow >>>> $ python >>>> Python 3.7.2 (default, Jun 6 2019, 09:12:17) >>>> [GCC 8.2.0] on linux >>>> Type "help", "copyright", "credits" or "license" for more information. >>>>>>> import tensorflow as tf >>>> [g001.curta.zedat.fu-berlin.de:227147] OPAL ERROR: Error in file >>>> pmix2x_client.c at line 109 >>>> >>>> -------------------------------------------------------------------------- >>>> The application appears to have been direct launched using "srun", >>>> but OMPI was not built with SLURM's PMI support and therefore cannot >>>> execute. There are several options for building PMI support under >>>> SLURM, depending upon the SLURM version you are using: >>>> >>>> version 16.05 or later: you can use SLURM's PMIx support. This >>>> requires that you configure and build SLURM --with-pmix. >>>> >>>> Versions earlier than 16.05: you must use either SLURM's PMI-1 or >>>> PMI-2 support. SLURM builds PMI-1 by default, or you can manually >>>> install PMI-2. You must then build Open MPI using --with-pmi pointing >>>> to the SLURM PMI library location. >>>> >>>> Please configure as appropriate and try again. >>>> >>>> -------------------------------------------------------------------------- >>>> *** An error occurred in MPI_Init_thread >>>> *** on a NULL communicator >>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>>> *** and potentially your MPI job) >>>> [g001.curta.zedat.fu-berlin.de:227147] Local abort before MPI_INIT >>>> completed completed successfully, but am not able to aggregate error >>>> messages, and not able to guarantee that all other processes were killed! >>>> >>>> With a bit of googling I found this: >>>> >>>> https://gist.github.com/boegel/c3605eb614916af4a6243ae91fd29b33 >>>> >>>> Is this indeed an EB problem? >>>> >>>> Not having any understanding of TensorFlow, I don't know why just >>>> loading the Python module causes a Slurm job to be launched. >>>> >>>> Cheers, >>>> >>>> Loris >>>> >>> > -- > Dr. Loris Bennett (Mr.) > ZEDAT, Freie Universität Berlin Email [email protected]

