Hi,
I have built
TensorFlow/1.13.1-fosscuda-2019a-Python-3.7.2
However, when I import the Python module I get the following error
$ module add TensorFlow
$ python
Python 3.7.2 (default, Jun 6 2019, 09:12:17)
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
[g001.curta.zedat.fu-berlin.de:227147] OPAL ERROR: Error in file
pmix2x_client.c at line 109
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[g001.curta.zedat.fu-berlin.de:227147] Local abort before MPI_INIT completed
completed successfully, but am not able to aggregate error messages, and not
able to guarantee that all other processes were killed!
With a bit of googling I found this:
https://gist.github.com/boegel/c3605eb614916af4a6243ae91fd29b33
Is this indeed an EB problem?
Not having any understanding of TensorFlow, I don't know why just
loading the Python module causes a Slurm job to be launched.
Cheers,
Loris
--
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email [email protected]