Dear Loris,

Isn’t it the opposite? Openmpi has to be built with Slurm properly? 
We had several issues with openmpi when it was compiled with dL-open and 
similar. I‘d have to check exact configuration flags .
Best 
Andreas 

> Am 29.01.2020 um 14:40 schrieb Loris Bennett <[email protected]>:
> 
> Hi,
> 
> Thinking about the problem, if it were a question of just rebuilding
> Slurm with a different version of OpenMPI, then presumably other
> MPI-programs would have issues with Slurm, but we haven't seen this.
> 
> So I am still mystified.
> 
> Cheers,
> 
> Loris
> 
> Loris Bennett <[email protected]> writes:
> 
>> Hi Kenneth,
>> 
>> I have tried two different things:
>> 
>> 1.
>> 
>> Starting and interactive job as user 'loris' via Slurm on a GPU-node,
>> loading the TensorFlow module, starting Python and then importing the
>> python module 'tensorflow'.  This triggers the original error below.
>> 
>> 2.
>> 
>> Logging directly into to the same GPU node as above as 'root', loading
>> the TensorFlow module, starting Python and then importing the python
>> module 'tensorflow'.  This triggers the following warning:
>> 
>>  A process has executed an operation involving a call to the
>>  "fork()" system call to create a child process.  Open MPI is currently
>>  operating in a condition that could result in memory corruption or
>>  other system errors; your job may hang, crash, or produce silent
>>  data corruption.  The use of fork() (or system() or other calls that
>>  create child processes) is strongly discouraged.
>> 
>>  The process that invoked fork was:
>> 
>>    Local host:          [[17982,1],0] (PID 47441)
>> 
>>  If you are *absolutely sure* that your application will successfully
>>  and correctly survive a call to fork(), you may disable this warning
>>  by setting the mpi_warn_on_fork MCA parameter to 0.
>> 
>> I can then, however, successfully start a TensorFlow session:
>> 
>>>>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
>>  2020-01-28 15:32:57.120084: I 
>> tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with 
>> properties: 
>>  name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
>>  pciBusID: 0000:5e:00.0
>>  totalMemory: 10.92GiB freeMemory: 10.75GiB
>>  2020-01-28 15:32:57.225946: I 
>> tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with 
>> properties: 
>>  name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
>>  pciBusID: 0000:d8:00.0
>>  totalMemory: 10.92GiB freeMemory: 10.76GiB
>>  2020-01-28 15:32:57.226648: I 
>> tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu 
>> devices: 0, 1
>>  2020-01-28 15:32:58.784466: I 
>> tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect 
>> StreamExecutor with strength 1 edge matrix:
>>  2020-01-28 15:32:58.784506: I 
>> tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 1 
>>  2020-01-28 15:32:58.784513: I 
>> tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N Y 
>>  2020-01-28 15:32:58.784517: I 
>> tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1:   Y N 
>>  2020-01-28 15:32:58.784649: I 
>> tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow 
>> device (/job:localhost/replica:0/task:0/device:GPU:0 with 10386 MB memory) 
>> -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 
>> 0000:5e:00.0, compute capability: 6.1)
>>  2020-01-28 15:32:58.785041: I 
>> tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow 
>> device (/job:localhost/replica:0/task:0/device:GPU:1 with 10398 MB memory) 
>> -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 
>> 0000:d8:00.0, compute capability: 6.1)
>>  2020-01-28 15:32:58.785979: I 
>> tensorflow/core/common_runtime/process_util.cc:71] Creating new thread pool 
>> with default inter op setting: 2. Tune using inter_op_parallelism_threads 
>> for best performance.
>>  Device mapping:
>>  /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce 
>> GTX 1080 Ti, pci bus id: 0000:5e:00.0, compute capability: 6.1
>>  /job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: GeForce 
>> GTX 1080 Ti, pci bus id: 0000:d8:00.0, compute capability: 6.1
>>  2020-01-28 15:32:58.786118: I 
>> tensorflow/core/common_runtime/direct_session.cc:317] Device mapping:
>>  /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce 
>> GTX 1080 Ti, pci bus id: 0000:5e:00.0, compute capability: 6.1
>>  /job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: GeForce 
>> GTX 1080 Ti, pci bus id: 0000:d8:00.0, compute capability: 6.1
>> 
>> So it looks as if the problem is a mismatch between the version of
>> OpenMPI used to build Slurm (Version 1.10.7 from CentOS 7.7) and the
>> version loaded by out TensorFlow module (OpenMPI/3.1.3).  None of the
>> values for the '--mpi' option for 'srun' make any difference.
>> 
>> Perhaps Slurm needs to be rebuilt with OpenMPI 3.1.3, but my very
>> limited understanding of PMI lead me to believe that this can be used
>> instead of OpenMPI for starting MPI processes.
>> 
>> Cheers,
>> 
>> Loris
>> 
>> Kenneth Hoste <[email protected]> writes:
>> 
>>> Hi Loris,
>>> 
>>> In which type of environment are you hitting this issue?
>>> 
>>> Is this in a Slurm job environment? Is it an interactive environment?
>>> 
>>> The problem is that loading the TensorFlow Python module triggers an 
>>> MPI_Init,
>>> and Slurm doesn't like that for the reasons it mentions.
>>> 
>>> We've been hitting this on our site too (but maybe only in my own personal
>>> account, not system-wide), I haven't gotten to the bottom of it yet...
>>> 
>>> 
>>> regards,
>>> 
>>> Kenneth
>>> 
>>>> On 22/01/2020 14:22, Loris Bennett wrote:
>>>> Hi,
>>>> 
>>>> I have built
>>>> 
>>>>   TensorFlow/1.13.1-fosscuda-2019a-Python-3.7.2
>>>> 
>>>> However, when I import the Python module I get the following error
>>>> 
>>>>   $ module add TensorFlow
>>>>   $ python
>>>>   Python 3.7.2 (default, Jun  6 2019, 09:12:17)
>>>>   [GCC 8.2.0] on linux
>>>>   Type "help", "copyright", "credits" or "license" for more information.
>>>>>>> import tensorflow as tf
>>>>   [g001.curta.zedat.fu-berlin.de:227147] OPAL ERROR: Error in file 
>>>> pmix2x_client.c at line 109
>>>>   
>>>> --------------------------------------------------------------------------
>>>>   The application appears to have been direct launched using "srun",
>>>>   but OMPI was not built with SLURM's PMI support and therefore cannot
>>>>   execute. There are several options for building PMI support under
>>>>   SLURM, depending upon the SLURM version you are using:
>>>> 
>>>>     version 16.05 or later: you can use SLURM's PMIx support. This
>>>>     requires that you configure and build SLURM --with-pmix.
>>>> 
>>>>     Versions earlier than 16.05: you must use either SLURM's PMI-1 or
>>>>     PMI-2 support. SLURM builds PMI-1 by default, or you can manually
>>>>     install PMI-2. You must then build Open MPI using --with-pmi pointing
>>>>     to the SLURM PMI library location.
>>>> 
>>>>   Please configure as appropriate and try again.
>>>>   
>>>> --------------------------------------------------------------------------
>>>>   *** An error occurred in MPI_Init_thread
>>>>   *** on a NULL communicator
>>>>   *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>>>   ***    and potentially your MPI job)
>>>>   [g001.curta.zedat.fu-berlin.de:227147] Local abort before MPI_INIT 
>>>> completed completed successfully, but am not able to aggregate error 
>>>> messages, and not able to guarantee that all other processes were killed!
>>>> 
>>>> With a bit of googling I found this:
>>>> 
>>>>   https://gist.github.com/boegel/c3605eb614916af4a6243ae91fd29b33
>>>> 
>>>> Is this indeed an EB problem?
>>>> 
>>>> Not having any understanding of TensorFlow, I don't know why just
>>>> loading the Python module causes a Slurm job to be launched.
>>>> 
>>>> Cheers,
>>>> 
>>>> Loris
>>>> 
>>> 
> -- 
> Dr. Loris Bennett (Mr.)
> ZEDAT, Freie Universität Berlin         Email [email protected]

Reply via email to