Hi Kenneth,
I have tried two different things:
1.
Starting and interactive job as user 'loris' via Slurm on a GPU-node,
loading the TensorFlow module, starting Python and then importing the
python module 'tensorflow'. This triggers the original error below.
2.
Logging directly into to the same GPU node as above as 'root', loading
the TensorFlow module, starting Python and then importing the python
module 'tensorflow'. This triggers the following warning:
A process has executed an operation involving a call to the
"fork()" system call to create a child process. Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption. The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.
The process that invoked fork was:
Local host: [[17982,1],0] (PID 47441)
If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
I can then, however, successfully start a TensorFlow session:
>>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
2020-01-28 15:32:57.120084: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with
properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:5e:00.0
totalMemory: 10.92GiB freeMemory: 10.75GiB
2020-01-28 15:32:57.225946: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with
properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:d8:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2020-01-28 15:32:57.226648: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu
devices: 0, 1
2020-01-28 15:32:58.784466: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect
StreamExecutor with strength 1 edge matrix:
2020-01-28 15:32:58.784506: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 1
2020-01-28 15:32:58.784513: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N Y
2020-01-28 15:32:58.784517: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: Y N
2020-01-28 15:32:58.784649: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow
device (/job:localhost/replica:0/task:0/device:GPU:0 with 10386 MB memory) ->
physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:5e:00.0,
compute capability: 6.1)
2020-01-28 15:32:58.785041: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow
device (/job:localhost/replica:0/task:0/device:GPU:1 with 10398 MB memory) ->
physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:d8:00.0,
compute capability: 6.1)
2020-01-28 15:32:58.785979: I
tensorflow/core/common_runtime/process_util.cc:71] Creating new thread pool
with default inter op setting: 2. Tune using inter_op_parallelism_threads for
best performance.
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX
1080 Ti, pci bus id: 0000:5e:00.0, compute capability: 6.1
/job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: GeForce GTX
1080 Ti, pci bus id: 0000:d8:00.0, compute capability: 6.1
2020-01-28 15:32:58.786118: I
tensorflow/core/common_runtime/direct_session.cc:317] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX
1080 Ti, pci bus id: 0000:5e:00.0, compute capability: 6.1
/job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: GeForce GTX
1080 Ti, pci bus id: 0000:d8:00.0, compute capability: 6.1
So it looks as if the problem is a mismatch between the version of
OpenMPI used to build Slurm (Version 1.10.7 from CentOS 7.7) and the
version loaded by out TensorFlow module (OpenMPI/3.1.3). None of the
values for the '--mpi' option for 'srun' make any difference.
Perhaps Slurm needs to be rebuilt with OpenMPI 3.1.3, but my very
limited understanding of PMI lead me to believe that this can be used
instead of OpenMPI for starting MPI processes.
Cheers,
Loris
Kenneth Hoste <[email protected]> writes:
> Hi Loris,
>
> In which type of environment are you hitting this issue?
>
> Is this in a Slurm job environment? Is it an interactive environment?
>
> The problem is that loading the TensorFlow Python module triggers an MPI_Init,
> and Slurm doesn't like that for the reasons it mentions.
>
> We've been hitting this on our site too (but maybe only in my own personal
> account, not system-wide), I haven't gotten to the bottom of it yet...
>
>
> regards,
>
> Kenneth
>
> On 22/01/2020 14:22, Loris Bennett wrote:
>> Hi,
>>
>> I have built
>>
>> TensorFlow/1.13.1-fosscuda-2019a-Python-3.7.2
>>
>> However, when I import the Python module I get the following error
>>
>> $ module add TensorFlow
>> $ python
>> Python 3.7.2 (default, Jun 6 2019, 09:12:17)
>> [GCC 8.2.0] on linux
>> Type "help", "copyright", "credits" or "license" for more information.
>> >>> import tensorflow as tf
>> [g001.curta.zedat.fu-berlin.de:227147] OPAL ERROR: Error in file
>> pmix2x_client.c at line 109
>> --------------------------------------------------------------------------
>> The application appears to have been direct launched using "srun",
>> but OMPI was not built with SLURM's PMI support and therefore cannot
>> execute. There are several options for building PMI support under
>> SLURM, depending upon the SLURM version you are using:
>>
>> version 16.05 or later: you can use SLURM's PMIx support. This
>> requires that you configure and build SLURM --with-pmix.
>>
>> Versions earlier than 16.05: you must use either SLURM's PMI-1 or
>> PMI-2 support. SLURM builds PMI-1 by default, or you can manually
>> install PMI-2. You must then build Open MPI using --with-pmi pointing
>> to the SLURM PMI library location.
>>
>> Please configure as appropriate and try again.
>> --------------------------------------------------------------------------
>> *** An error occurred in MPI_Init_thread
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> *** and potentially your MPI job)
>> [g001.curta.zedat.fu-berlin.de:227147] Local abort before MPI_INIT
>> completed completed successfully, but am not able to aggregate error
>> messages, and not able to guarantee that all other processes were killed!
>>
>> With a bit of googling I found this:
>>
>> https://gist.github.com/boegel/c3605eb614916af4a6243ae91fd29b33
>>
>> Is this indeed an EB problem?
>>
>> Not having any understanding of TensorFlow, I don't know why just
>> loading the Python module causes a Slurm job to be launched.
>>
>> Cheers,
>>
>> Loris
>>
>
--
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email [email protected]