Åke Sandgren <[email protected]> writes: > On 12/5/19 4:46 PM, Loris Bennett wrote: >> Åke Sandgren <[email protected]> writes: >> >>> On 12/5/19 11:40 AM, Loris Bennett wrote: >>>> I have tried this with >>>> >>>> #!/bin/bash >>>> >>>> #SBATCH --job-name=easybuild_gpu >>>> #SBATCH --ntasks=4 >>>> #SBATCH --time=12:00:00 >>>> #SBATCH --mem-per-cpu=1G >>>> #SBATCH --partition=gpu >>>> #SBATCH --qos=medium >>>> >>>> srun eb Keras-2.2.4-fosscuda-2019a-Python-3.7.2.eb --robot >>> >>> Drop the srun part. You don't want to start 4 eb's doing the same thing. >>> That may be the reason for your error. >> >> Doesn't Easybuild use the number of cores available for parallel make? > > Yes, but srun starts --ntasks instances of its argument, in this case > eb, i.e. something you should only use for MPI programs. > > eb in itself is just a piece of serial Python code.
Got it. >>>> but get the error >>>> >>>> == FAILED: Installation ended unsuccessfully (build directory: >>>> /trinity/shared/easybuild/build/TensorFlow/1.13.1/fosscuda-2019a-Python-3.7.2): >>>> build failed (first 300 chars): Failed to chmod/chown several paths: >>>> ['/trinity/shared/easybuild/build/TensorFlow/1.13.1/fosscuda-2019a-Python-3.7.2', >>>> '/trinity/shared/easybuild/build/TensorFlow/1.13.1/fosscuda-2019a-Python-3.7.2/protobufpython', >>>> '/trinity/shared/easybuild/build/TensorFlow/1.13.1/fosscuda-2019a-Python-3.7.2/abslpy >>>> (took 4 sec) >>>> >>>> I'm running the Slurm job as the same user I use always to run >>>> Easybuild, so all the above directories are already owned by that user. >>>> >>>> Any ideas about what I might be doing wrong? >>> >>> You need to look in the log file to see what the actual error is, the >>> summary just tells you something went wrong. >> >> The actual error is >> >> last error: [Errno 30] Read-only file system: >> >> '/trinity/shared/easybuild/build/TensorFlow/1.13.1/fosscuda-2019a-Python-3.7.2/TensorFlow/tensorflow-1.13.1/tools/python_bin_path.sh' >> >> Indeed the NFS directory containing Easybuild and all the software was >> mounted read-only on the compute nodes. >> >> So I remounted read-write, but I still get the same error :-/ > > Did you make sure it got remounted RW on ALL nodes? > Since if you get the same error of RO file system... I just remounted on one node and used #SBATCH --nodelist=g011 I then tried running eb directly on the node without Slurm - same story. Maybe I'm just doing something stupid because it's late. I'll try again tomorrow. Thanks for the help, Loris -- Dr. Loris Bennett (Mr.) ZEDAT, Freie Universität Berlin Email [email protected]

