Re: [easybuild] Specifying partition for job-backend 'Slurm'

Åke Sandgren Thu, 05 Dec 2019 07:59:41 -0800

On 12/5/19 4:46 PM, Loris Bennett wrote:
> Åke Sandgren <[email protected]> writes:
> 
>> On 12/5/19 11:40 AM, Loris Bennett wrote:
>>> I have tried this with 
>>>
>>>   #!/bin/bash
>>>
>>>   #SBATCH --job-name=easybuild_gpu
>>>   #SBATCH --ntasks=4
>>>   #SBATCH --time=12:00:00
>>>   #SBATCH --mem-per-cpu=1G
>>>   #SBATCH --partition=gpu
>>>   #SBATCH --qos=medium
>>>
>>>   srun eb Keras-2.2.4-fosscuda-2019a-Python-3.7.2.eb --robot
>>
>> Drop the srun part. You don't want to start 4 eb's doing the same thing.
>> That may be the reason for your error.
> 
> Doesn't Easybuild use the number of cores available for parallel make? 

Yes, but srun starts --ntasks instances of its argument, in this case
eb, i.e. something you should only use for MPI programs.

eb in itself is just a piece of serial Python code.

>>> but get the error
>>>
>>>   == FAILED: Installation ended unsuccessfully (build directory:
>>> /trinity/shared/easybuild/build/TensorFlow/1.13.1/fosscuda-2019a-Python-3.7.2):
>>> build failed (first 300 chars): Failed to chmod/chown several paths:
>>> ['/trinity/shared/easybuild/build/TensorFlow/1.13.1/fosscuda-2019a-Python-3.7.2',
>>> '/trinity/shared/easybuild/build/TensorFlow/1.13.1/fosscuda-2019a-Python-3.7.2/protobufpython',
>>> '/trinity/shared/easybuild/build/TensorFlow/1.13.1/fosscuda-2019a-Python-3.7.2/abslpy
>>> (took 4 sec)
>>>
>>> I'm running the Slurm job as the same user I use always to run
>>> Easybuild, so all the above directories are already owned by that user.
>>>
>>> Any ideas about what I might be doing wrong?
>>
>> You need to look in the log file to see what the actual error is, the
>> summary just tells you something went wrong.
> 
> The actual error is 
> 
>   last error: [Errno 30] Read-only file system:
>   
> '/trinity/shared/easybuild/build/TensorFlow/1.13.1/fosscuda-2019a-Python-3.7.2/TensorFlow/tensorflow-1.13.1/tools/python_bin_path.sh'
> 
> Indeed the NFS directory containing Easybuild and all the software was
> mounted read-only on the compute nodes.
> 
> So I remounted read-write, but I still get the same error :-/

Did you make sure it got remounted RW on ALL nodes?
Since if you get the same error of RO file system...

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: [email protected]   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
Re: [easybuild] Specifying partition for job-backend 'Slurm'

Reply via email to