Hi Loris,

In which type of environment are you hitting this issue?

Is this in a Slurm job environment? Is it an interactive environment?

The problem is that loading the TensorFlow Python module triggers an MPI_Init, and Slurm doesn't like that for the reasons it mentions.

We've been hitting this on our site too (but maybe only in my own personal account, not system-wide), I haven't gotten to the bottom of it yet...


regards,

Kenneth

On 22/01/2020 14:22, Loris Bennett wrote:
Hi,

I have built

   TensorFlow/1.13.1-fosscuda-2019a-Python-3.7.2

However, when I import the Python module I get the following error

   $ module add TensorFlow
   $ python
   Python 3.7.2 (default, Jun  6 2019, 09:12:17)
   [GCC 8.2.0] on linux
   Type "help", "copyright", "credits" or "license" for more information.
   >>> import tensorflow as tf
   [g001.curta.zedat.fu-berlin.de:227147] OPAL ERROR: Error in file 
pmix2x_client.c at line 109
   --------------------------------------------------------------------------
   The application appears to have been direct launched using "srun",
   but OMPI was not built with SLURM's PMI support and therefore cannot
   execute. There are several options for building PMI support under
   SLURM, depending upon the SLURM version you are using:

     version 16.05 or later: you can use SLURM's PMIx support. This
     requires that you configure and build SLURM --with-pmix.

     Versions earlier than 16.05: you must use either SLURM's PMI-1 or
     PMI-2 support. SLURM builds PMI-1 by default, or you can manually
     install PMI-2. You must then build Open MPI using --with-pmi pointing
     to the SLURM PMI library location.

   Please configure as appropriate and try again.
   --------------------------------------------------------------------------
   *** An error occurred in MPI_Init_thread
   *** on a NULL communicator
   *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
   ***    and potentially your MPI job)
   [g001.curta.zedat.fu-berlin.de:227147] Local abort before MPI_INIT completed 
completed successfully, but am not able to aggregate error messages, and not 
able to guarantee that all other processes were killed!

With a bit of googling I found this:

   https://gist.github.com/boegel/c3605eb614916af4a6243ae91fd29b33

Is this indeed an EB problem?

Not having any understanding of TensorFlow, I don't know why just
loading the Python module causes a Slurm job to be launched.

Cheers,

Loris

Reply via email to