Re: [easybuild] Specifying partition for job-backend 'Slurm'

Loris Bennett Thu, 05 Dec 2019 08:07:27 -0800

Åke Sandgren <[email protected]> writes:

> On 12/5/19 4:46 PM, Loris Bennett wrote:
>> Åke Sandgren <[email protected]> writes:
>> 
>>> On 12/5/19 11:40 AM, Loris Bennett wrote:
>>>> I have tried this with 
>>>>
>>>>   #!/bin/bash
>>>>
>>>>   #SBATCH --job-name=easybuild_gpu
>>>>   #SBATCH --ntasks=4
>>>>   #SBATCH --time=12:00:00
>>>>   #SBATCH --mem-per-cpu=1G
>>>>   #SBATCH --partition=gpu
>>>>   #SBATCH --qos=medium
>>>>
>>>>   srun eb Keras-2.2.4-fosscuda-2019a-Python-3.7.2.eb --robot
>>>
>>> Drop the srun part. You don't want to start 4 eb's doing the same thing.
>>> That may be the reason for your error.
>> 
>> Doesn't Easybuild use the number of cores available for parallel make? 
>
> Yes, but srun starts --ntasks instances of its argument, in this case
> eb, i.e. something you should only use for MPI programs.
>
> eb in itself is just a piece of serial Python code.


Got it.

>>>> but get the error
>>>>
>>>>   == FAILED: Installation ended unsuccessfully (build directory:
>>>> /trinity/shared/easybuild/build/TensorFlow/1.13.1/fosscuda-2019a-Python-3.7.2):
>>>> build failed (first 300 chars): Failed to chmod/chown several paths:
>>>> ['/trinity/shared/easybuild/build/TensorFlow/1.13.1/fosscuda-2019a-Python-3.7.2',
>>>> '/trinity/shared/easybuild/build/TensorFlow/1.13.1/fosscuda-2019a-Python-3.7.2/protobufpython',
>>>> '/trinity/shared/easybuild/build/TensorFlow/1.13.1/fosscuda-2019a-Python-3.7.2/abslpy
>>>> (took 4 sec)
>>>>
>>>> I'm running the Slurm job as the same user I use always to run
>>>> Easybuild, so all the above directories are already owned by that user.
>>>>
>>>> Any ideas about what I might be doing wrong?
>>>
>>> You need to look in the log file to see what the actual error is, the
>>> summary just tells you something went wrong.
>> 
>> The actual error is 
>> 
>>   last error: [Errno 30] Read-only file system:
>>   
>> '/trinity/shared/easybuild/build/TensorFlow/1.13.1/fosscuda-2019a-Python-3.7.2/TensorFlow/tensorflow-1.13.1/tools/python_bin_path.sh'
>> 
>> Indeed the NFS directory containing Easybuild and all the software was
>> mounted read-only on the compute nodes.
>> 
>> So I remounted read-write, but I still get the same error :-/
>
> Did you make sure it got remounted RW on ALL nodes?
> Since if you get the same error of RO file system...

I just remounted on one node and used

  #SBATCH --nodelist=g011

I then tried running eb directly on the node without Slurm - same
story.  Maybe I'm just doing something stupid because it's late.  I'll
try again tomorrow.

Thanks for the help,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin         Email [email protected]

Re: [easybuild] Specifying partition for job-backend 'Slurm'

Reply via email to