Hi,

Thinking about the problem, if it were a question of just rebuilding
Slurm with a different version of OpenMPI, then presumably other
MPI-programs would have issues with Slurm, but we haven't seen this.

So I am still mystified.

Cheers,

Loris

Loris Bennett <[email protected]> writes:

> Hi Kenneth,
>
> I have tried two different things:
>
> 1.
>    
> Starting and interactive job as user 'loris' via Slurm on a GPU-node,
> loading the TensorFlow module, starting Python and then importing the
> python module 'tensorflow'.  This triggers the original error below.
>
> 2.
>
> Logging directly into to the same GPU node as above as 'root', loading
> the TensorFlow module, starting Python and then importing the python
> module 'tensorflow'.  This triggers the following warning:
>
>   A process has executed an operation involving a call to the
>   "fork()" system call to create a child process.  Open MPI is currently
>   operating in a condition that could result in memory corruption or
>   other system errors; your job may hang, crash, or produce silent
>   data corruption.  The use of fork() (or system() or other calls that
>   create child processes) is strongly discouraged.
>
>   The process that invoked fork was:
>
>     Local host:          [[17982,1],0] (PID 47441)
>
>   If you are *absolutely sure* that your application will successfully
>   and correctly survive a call to fork(), you may disable this warning
>   by setting the mpi_warn_on_fork MCA parameter to 0.
>
> I can then, however, successfully start a TensorFlow session:
>
>   >>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
>   2020-01-28 15:32:57.120084: I 
> tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with 
> properties: 
>   name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
>   pciBusID: 0000:5e:00.0
>   totalMemory: 10.92GiB freeMemory: 10.75GiB
>   2020-01-28 15:32:57.225946: I 
> tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with 
> properties: 
>   name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
>   pciBusID: 0000:d8:00.0
>   totalMemory: 10.92GiB freeMemory: 10.76GiB
>   2020-01-28 15:32:57.226648: I 
> tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu 
> devices: 0, 1
>   2020-01-28 15:32:58.784466: I 
> tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect 
> StreamExecutor with strength 1 edge matrix:
>   2020-01-28 15:32:58.784506: I 
> tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 1 
>   2020-01-28 15:32:58.784513: I 
> tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N Y 
>   2020-01-28 15:32:58.784517: I 
> tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1:   Y N 
>   2020-01-28 15:32:58.784649: I 
> tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow 
> device (/job:localhost/replica:0/task:0/device:GPU:0 with 10386 MB memory) -> 
> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:5e:00.0, 
> compute capability: 6.1)
>   2020-01-28 15:32:58.785041: I 
> tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow 
> device (/job:localhost/replica:0/task:0/device:GPU:1 with 10398 MB memory) -> 
> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:d8:00.0, 
> compute capability: 6.1)
>   2020-01-28 15:32:58.785979: I 
> tensorflow/core/common_runtime/process_util.cc:71] Creating new thread pool 
> with default inter op setting: 2. Tune using inter_op_parallelism_threads for 
> best performance.
>   Device mapping:
>   /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce 
> GTX 1080 Ti, pci bus id: 0000:5e:00.0, compute capability: 6.1
>   /job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: GeForce 
> GTX 1080 Ti, pci bus id: 0000:d8:00.0, compute capability: 6.1
>   2020-01-28 15:32:58.786118: I 
> tensorflow/core/common_runtime/direct_session.cc:317] Device mapping:
>   /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce 
> GTX 1080 Ti, pci bus id: 0000:5e:00.0, compute capability: 6.1
>   /job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: GeForce 
> GTX 1080 Ti, pci bus id: 0000:d8:00.0, compute capability: 6.1
>
> So it looks as if the problem is a mismatch between the version of
> OpenMPI used to build Slurm (Version 1.10.7 from CentOS 7.7) and the
> version loaded by out TensorFlow module (OpenMPI/3.1.3).  None of the
> values for the '--mpi' option for 'srun' make any difference.
>
> Perhaps Slurm needs to be rebuilt with OpenMPI 3.1.3, but my very
> limited understanding of PMI lead me to believe that this can be used
> instead of OpenMPI for starting MPI processes.
>
> Cheers,
>
> Loris
>    
> Kenneth Hoste <[email protected]> writes:
>
>> Hi Loris,
>>
>> In which type of environment are you hitting this issue?
>>
>> Is this in a Slurm job environment? Is it an interactive environment?
>>
>> The problem is that loading the TensorFlow Python module triggers an 
>> MPI_Init,
>> and Slurm doesn't like that for the reasons it mentions.
>>
>> We've been hitting this on our site too (but maybe only in my own personal
>> account, not system-wide), I haven't gotten to the bottom of it yet...
>>
>>
>> regards,
>>
>> Kenneth
>>
>> On 22/01/2020 14:22, Loris Bennett wrote:
>>> Hi,
>>>
>>> I have built
>>>
>>>    TensorFlow/1.13.1-fosscuda-2019a-Python-3.7.2
>>>
>>> However, when I import the Python module I get the following error
>>>
>>>    $ module add TensorFlow
>>>    $ python
>>>    Python 3.7.2 (default, Jun  6 2019, 09:12:17)
>>>    [GCC 8.2.0] on linux
>>>    Type "help", "copyright", "credits" or "license" for more information.
>>>    >>> import tensorflow as tf
>>>    [g001.curta.zedat.fu-berlin.de:227147] OPAL ERROR: Error in file 
>>> pmix2x_client.c at line 109
>>>    
>>> --------------------------------------------------------------------------
>>>    The application appears to have been direct launched using "srun",
>>>    but OMPI was not built with SLURM's PMI support and therefore cannot
>>>    execute. There are several options for building PMI support under
>>>    SLURM, depending upon the SLURM version you are using:
>>>
>>>      version 16.05 or later: you can use SLURM's PMIx support. This
>>>      requires that you configure and build SLURM --with-pmix.
>>>
>>>      Versions earlier than 16.05: you must use either SLURM's PMI-1 or
>>>      PMI-2 support. SLURM builds PMI-1 by default, or you can manually
>>>      install PMI-2. You must then build Open MPI using --with-pmi pointing
>>>      to the SLURM PMI library location.
>>>
>>>    Please configure as appropriate and try again.
>>>    
>>> --------------------------------------------------------------------------
>>>    *** An error occurred in MPI_Init_thread
>>>    *** on a NULL communicator
>>>    *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>>    ***    and potentially your MPI job)
>>>    [g001.curta.zedat.fu-berlin.de:227147] Local abort before MPI_INIT 
>>> completed completed successfully, but am not able to aggregate error 
>>> messages, and not able to guarantee that all other processes were killed!
>>>
>>> With a bit of googling I found this:
>>>
>>>    https://gist.github.com/boegel/c3605eb614916af4a6243ae91fd29b33
>>>
>>> Is this indeed an EB problem?
>>>
>>> Not having any understanding of TensorFlow, I don't know why just
>>> loading the Python module causes a Slurm job to be launched.
>>>
>>> Cheers,
>>>
>>> Loris
>>>
>>
-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin         Email [email protected]

Reply via email to