On 12/5/19 4:46 PM, Loris Bennett wrote:
> Åke Sandgren <[email protected]> writes:
>
>> On 12/5/19 11:40 AM, Loris Bennett wrote:
>>> I have tried this with
>>>
>>> #!/bin/bash
>>>
>>> #SBATCH --job-name=easybuild_gpu
>>> #SBATCH --ntasks=4
>>> #SBATCH --time=12:00:00
>>> #SBATCH --mem-per-cpu=1G
>>> #SBATCH --partition=gpu
>>> #SBATCH --qos=medium
>>>
>>> srun eb Keras-2.2.4-fosscuda-2019a-Python-3.7.2.eb --robot
>>
>> Drop the srun part. You don't want to start 4 eb's doing the same thing.
>> That may be the reason for your error.
>
> Doesn't Easybuild use the number of cores available for parallel make?
Yes, but srun starts --ntasks instances of its argument, in this case
eb, i.e. something you should only use for MPI programs.
eb in itself is just a piece of serial Python code.
>>> but get the error
>>>
>>> == FAILED: Installation ended unsuccessfully (build directory:
>>> /trinity/shared/easybuild/build/TensorFlow/1.13.1/fosscuda-2019a-Python-3.7.2):
>>> build failed (first 300 chars): Failed to chmod/chown several paths:
>>> ['/trinity/shared/easybuild/build/TensorFlow/1.13.1/fosscuda-2019a-Python-3.7.2',
>>> '/trinity/shared/easybuild/build/TensorFlow/1.13.1/fosscuda-2019a-Python-3.7.2/protobufpython',
>>> '/trinity/shared/easybuild/build/TensorFlow/1.13.1/fosscuda-2019a-Python-3.7.2/abslpy
>>> (took 4 sec)
>>>
>>> I'm running the Slurm job as the same user I use always to run
>>> Easybuild, so all the above directories are already owned by that user.
>>>
>>> Any ideas about what I might be doing wrong?
>>
>> You need to look in the log file to see what the actual error is, the
>> summary just tells you something went wrong.
>
> The actual error is
>
> last error: [Errno 30] Read-only file system:
>
> '/trinity/shared/easybuild/build/TensorFlow/1.13.1/fosscuda-2019a-Python-3.7.2/TensorFlow/tensorflow-1.13.1/tools/python_bin_path.sh'
>
> Indeed the NFS directory containing Easybuild and all the software was
> mounted read-only on the compute nodes.
>
> So I remounted read-write, but I still get the same error :-/
Did you make sure it got remounted RW on ALL nodes?
Since if you get the same error of RO file system...
--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: [email protected] Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se