Hi Kenneth,

I have tried two different things:

1.
   
Starting and interactive job as user 'loris' via Slurm on a GPU-node,
loading the TensorFlow module, starting Python and then importing the
python module 'tensorflow'.  This triggers the original error below.

2.

Logging directly into to the same GPU node as above as 'root', loading
the TensorFlow module, starting Python and then importing the python
module 'tensorflow'.  This triggers the following warning:

  A process has executed an operation involving a call to the
  "fork()" system call to create a child process.  Open MPI is currently
  operating in a condition that could result in memory corruption or
  other system errors; your job may hang, crash, or produce silent
  data corruption.  The use of fork() (or system() or other calls that
  create child processes) is strongly discouraged.

  The process that invoked fork was:

    Local host:          [[17982,1],0] (PID 47441)

  If you are *absolutely sure* that your application will successfully
  and correctly survive a call to fork(), you may disable this warning
  by setting the mpi_warn_on_fork MCA parameter to 0.

I can then, however, successfully start a TensorFlow session:

  >>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
  2020-01-28 15:32:57.120084: I 
tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with 
properties: 
  name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
  pciBusID: 0000:5e:00.0
  totalMemory: 10.92GiB freeMemory: 10.75GiB
  2020-01-28 15:32:57.225946: I 
tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with 
properties: 
  name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
  pciBusID: 0000:d8:00.0
  totalMemory: 10.92GiB freeMemory: 10.76GiB
  2020-01-28 15:32:57.226648: I 
tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu 
devices: 0, 1
  2020-01-28 15:32:58.784466: I 
tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect 
StreamExecutor with strength 1 edge matrix:
  2020-01-28 15:32:58.784506: I 
tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 1 
  2020-01-28 15:32:58.784513: I 
tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N Y 
  2020-01-28 15:32:58.784517: I 
tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1:   Y N 
  2020-01-28 15:32:58.784649: I 
tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow 
device (/job:localhost/replica:0/task:0/device:GPU:0 with 10386 MB memory) -> 
physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:5e:00.0, 
compute capability: 6.1)
  2020-01-28 15:32:58.785041: I 
tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow 
device (/job:localhost/replica:0/task:0/device:GPU:1 with 10398 MB memory) -> 
physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:d8:00.0, 
compute capability: 6.1)
  2020-01-28 15:32:58.785979: I 
tensorflow/core/common_runtime/process_util.cc:71] Creating new thread pool 
with default inter op setting: 2. Tune using inter_op_parallelism_threads for 
best performance.
  Device mapping:
  /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 
1080 Ti, pci bus id: 0000:5e:00.0, compute capability: 6.1
  /job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: GeForce GTX 
1080 Ti, pci bus id: 0000:d8:00.0, compute capability: 6.1
  2020-01-28 15:32:58.786118: I 
tensorflow/core/common_runtime/direct_session.cc:317] Device mapping:
  /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 
1080 Ti, pci bus id: 0000:5e:00.0, compute capability: 6.1
  /job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: GeForce GTX 
1080 Ti, pci bus id: 0000:d8:00.0, compute capability: 6.1

So it looks as if the problem is a mismatch between the version of
OpenMPI used to build Slurm (Version 1.10.7 from CentOS 7.7) and the
version loaded by out TensorFlow module (OpenMPI/3.1.3).  None of the
values for the '--mpi' option for 'srun' make any difference.

Perhaps Slurm needs to be rebuilt with OpenMPI 3.1.3, but my very
limited understanding of PMI lead me to believe that this can be used
instead of OpenMPI for starting MPI processes.

Cheers,

Loris
   
Kenneth Hoste <[email protected]> writes:

> Hi Loris,
>
> In which type of environment are you hitting this issue?
>
> Is this in a Slurm job environment? Is it an interactive environment?
>
> The problem is that loading the TensorFlow Python module triggers an MPI_Init,
> and Slurm doesn't like that for the reasons it mentions.
>
> We've been hitting this on our site too (but maybe only in my own personal
> account, not system-wide), I haven't gotten to the bottom of it yet...
>
>
> regards,
>
> Kenneth
>
> On 22/01/2020 14:22, Loris Bennett wrote:
>> Hi,
>>
>> I have built
>>
>>    TensorFlow/1.13.1-fosscuda-2019a-Python-3.7.2
>>
>> However, when I import the Python module I get the following error
>>
>>    $ module add TensorFlow
>>    $ python
>>    Python 3.7.2 (default, Jun  6 2019, 09:12:17)
>>    [GCC 8.2.0] on linux
>>    Type "help", "copyright", "credits" or "license" for more information.
>>    >>> import tensorflow as tf
>>    [g001.curta.zedat.fu-berlin.de:227147] OPAL ERROR: Error in file 
>> pmix2x_client.c at line 109
>>    --------------------------------------------------------------------------
>>    The application appears to have been direct launched using "srun",
>>    but OMPI was not built with SLURM's PMI support and therefore cannot
>>    execute. There are several options for building PMI support under
>>    SLURM, depending upon the SLURM version you are using:
>>
>>      version 16.05 or later: you can use SLURM's PMIx support. This
>>      requires that you configure and build SLURM --with-pmix.
>>
>>      Versions earlier than 16.05: you must use either SLURM's PMI-1 or
>>      PMI-2 support. SLURM builds PMI-1 by default, or you can manually
>>      install PMI-2. You must then build Open MPI using --with-pmi pointing
>>      to the SLURM PMI library location.
>>
>>    Please configure as appropriate and try again.
>>    --------------------------------------------------------------------------
>>    *** An error occurred in MPI_Init_thread
>>    *** on a NULL communicator
>>    *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>    ***    and potentially your MPI job)
>>    [g001.curta.zedat.fu-berlin.de:227147] Local abort before MPI_INIT 
>> completed completed successfully, but am not able to aggregate error 
>> messages, and not able to guarantee that all other processes were killed!
>>
>> With a bit of googling I found this:
>>
>>    https://gist.github.com/boegel/c3605eb614916af4a6243ae91fd29b33
>>
>> Is this indeed an EB problem?
>>
>> Not having any understanding of TensorFlow, I don't know why just
>> loading the Python module causes a Slurm job to be launched.
>>
>> Cheers,
>>
>> Loris
>>
>
-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin         Email [email protected]

Reply via email to