Re: [easybuild] dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb - tests stalled

2023-09-06 Thread Loris Bennett
Loris Bennett  writes:

> Hi Kenneth,
>
> Kenneth Hoste  writes:
>
>> Hi Loris,
>>
>> On 29/08/2023 08:19, Loris Bennett wrote:
>>> Hi,
>>> When I try to install 'dorado' via
>>>dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb
>>> the tests stall at some point.  The process tree is as follows:
>>>├─ /bin/bash /var/spool/slurmd/job14269886/slurm_script
>>>│ └─ /usr/bin/python3.6 -m easybuild.main
>>> dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb --robot
>>> --cuda-compute-capabilities=6.1,7.5 --buildpath=/dev/shm
>>> --tmpdir=/scratch/eb-bu
>>>│ └─ /bin/bash -c export
>>> PYTHONPATH=/scratch/eb-build/eb-yoirmakd/tmpc14mrksg/lib/python3.10/site-packages:$PYTHONPATH
>>> && cd test && PYTHONUNBUFFERED=1 /trinity/shar
>>>│ └─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>>> run_test.py --continue-through-error --verbose -x
>>> distributed/elastic/utils/distrib
>>>│ ├─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>>> distributed/rpc/test_share_memory.py -v --subprocess
>>>│ │ ├─
>>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
>>>│ │ │ ├─
>>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
>>>│   │  │  ...
>>>│ │ │ ├─
>>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
>>>│ │ │ ├─
>>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python -s -c
>>> from multiprocessing.resource_tracker import main;main(30)
>>>│ │ │ ├─
>>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
>>>│ │ │ ├─
>>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
>>>│ │ │ └─
>>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
>>>│ │ └─
>>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>>> distributed/rpc/test_share_memory.py -v --subprocess
>>>│ └─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>>> run_test.py --continue-through-error --verbose -x
>>> distributed/elastic/utils/dist
>>> The problem seems to be the that the following process hangs while
>>> calling 'read':
>>>Trace of process 95404 -
>>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python -s -c
>>> from multiprocessing.resource_tracker import main;main(30)
>>>strace: Process 95404 attached
>>>read(30,
>>> I have tried this twice and both times the installation has stopped
>>> like
>>> this, so I assume it is not some temporary issue with the file system.
>>> Does anyone have any ideas about what else I could look at?
>>
>>
>> Not sure at first sight, but maybe it's similar to a problem I ran
>> into with scipy recently, which boiled down to pytext-xdist getting
>> stuck when running in cgroups (for example when running from a Slurm
>> job), see https://github.com/pytest-dev/pytest-xdist/issues/658 .
>>
>> Doesn't look like it's exactly the same problem, but perhaps it gives
>> you a push in the right direction...
>
> Thanks for the tip.  However, even with --parallel=1 the test hangs at
> the same place - it just takes longer to get there :-)
>
> I guess I'll just try skipping the test for the moment.

Dorado depends on 

  PyTorch-1.12.0-foss-2022a-CUDA-11.7.0.eb

so I tried to build this first, but ran into the same issue.

Even with --skip-test-cases, the build seems to have stopped.  The
process tree looks like:

  ├─ /bin/bash /var/spool/slurmd/job14436854/slurm_script
  │  └─ /usr/bin/python3.6 -m easybuild.main 
PyTorch-1.12.0-foss-2022a-CUDA-11.7.0.eb --robot 
--cuda-compute-capabilities=6.1,7.5 --buildpath=/dev/shm 
--tmpdir=/scratch/eb-build --skip-test-cases
  │ └─ /bin/bash -c export 
PYTHONPATH=/scratch/eb-build/eb-90y_n4w9/tmp4papiynj/lib/python3.10/site-packages:$PYTHONPATH
 &&  cd test && PYTHONUNBUFFERED=1 
/trinity/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
run_test.py --continue-through-error  --verbose -x 
distributed/elastic/utils/distributed_test 
distributed/elastic/multiprocessing/api_test distributed/test_distributed_spawn 
test_optim test_model_dump distributed/fsdp/test_fsdp_memory 
distributed/fsdp/test_fsdp_overlap
  │└─ 
/trinity/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
run_test.py --continue-through-error --verbose -x 
distributed/elastic/utils/distributed_test 
distributed/elastic/multiprocessing/api_test distributed/test_distributed_spawn 
test_optim test_model_dump distributed/fsdp/test_fsdp_memory 
distributed/fsdp/test_fsdp_overlap
  │   ├─ 
/trinity/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v 

Re: [easybuild] dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb - tests stalled

2023-08-31 Thread Loris Bennett
Hi Kenneth,

Kenneth Hoste  writes:

> Hi Loris,
>
> On 29/08/2023 08:19, Loris Bennett wrote:
>> Hi,
>> When I try to install 'dorado' via
>>dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb
>> the tests stall at some point.  The process tree is as follows:
>>├─ /bin/bash /var/spool/slurmd/job14269886/slurm_script
>>│ └─ /usr/bin/python3.6 -m easybuild.main
>> dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb --robot
>> --cuda-compute-capabilities=6.1,7.5 --buildpath=/dev/shm
>> --tmpdir=/scratch/eb-bu
>>│ └─ /bin/bash -c export
>> PYTHONPATH=/scratch/eb-build/eb-yoirmakd/tmpc14mrksg/lib/python3.10/site-packages:$PYTHONPATH
>> && cd test && PYTHONUNBUFFERED=1 /trinity/shar
>>│ └─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>> run_test.py --continue-through-error --verbose -x
>> distributed/elastic/utils/distrib
>>│ ├─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>> distributed/rpc/test_share_memory.py -v --subprocess
>>│ │ ├─
>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
>>│ │ │ ├─
>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
>>│   │  │  ...
>>│ │ │ ├─
>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
>>│ │ │ ├─
>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python -s -c
>> from multiprocessing.resource_tracker import main;main(30)
>>│ │ │ ├─
>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
>>│ │ │ ├─
>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
>>│ │ │ └─
>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
>>│ │ └─
>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>> distributed/rpc/test_share_memory.py -v --subprocess
>>│ └─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>> run_test.py --continue-through-error --verbose -x
>> distributed/elastic/utils/dist
>> The problem seems to be the that the following process hangs while
>> calling 'read':
>>Trace of process 95404 -
>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python -s -c
>> from multiprocessing.resource_tracker import main;main(30)
>>strace: Process 95404 attached
>>read(30,
>> I have tried this twice and both times the installation has stopped
>> like
>> this, so I assume it is not some temporary issue with the file system.
>> Does anyone have any ideas about what else I could look at?
>
>
> Not sure at first sight, but maybe it's similar to a problem I ran
> into with scipy recently, which boiled down to pytext-xdist getting
> stuck when running in cgroups (for example when running from a Slurm
> job), see https://github.com/pytest-dev/pytest-xdist/issues/658 .
>
> Doesn't look like it's exactly the same problem, but perhaps it gives
> you a push in the right direction...

Thanks for the tip.  However, even with --parallel=1 the test hangs at
the same place - it just takes longer to get there :-)

I guess I'll just try skipping the test for the moment.

Cheers,

Loris

> regards,
>
> Kenneth
>
>> Cheers,
>> Loris
>> 
>
-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin


Re: [easybuild] dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb - tests stalled

2023-08-29 Thread Kenneth Hoste

Hi Loris,

On 29/08/2023 08:19, Loris Bennett wrote:

Hi,

When I try to install 'dorado' via

   dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb

the tests stall at some point.  The process tree is as follows:

   ├─ /bin/bash /var/spool/slurmd/job14269886/slurm_script
   │  └─ /usr/bin/python3.6 -m easybuild.main 
dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb --robot 
--cuda-compute-capabilities=6.1,7.5 --buildpath=/dev/shm --tmpdir=/scratch/eb-bu
   │ └─ /bin/bash -c export 
PYTHONPATH=/scratch/eb-build/eb-yoirmakd/tmpc14mrksg/lib/python3.10/site-packages:$PYTHONPATH 
&&  cd test && PYTHONUNBUFFERED=1 /trinity/shar
   │└─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
run_test.py --continue-through-error --verbose -x 
distributed/elastic/utils/distrib
   │   ├─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v --subprocess
   │   │  ├─ 
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
   │   │  │  ├─ 
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
   │   │  │  ...
   │   │  │  ├─ 
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
   │   │  │  ├─ 
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python -s -c from 
multiprocessing.resource_tracker import main;main(30)
   │   │  │  ├─ 
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
   │   │  │  ├─ 
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
   │   │  │  └─ 
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
   │   │  └─ 
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v --subprocess
   │   └─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
run_test.py --continue-through-error --verbose -x distributed/elastic/utils/dist

The problem seems to be the that the following process hangs while
calling 'read':

   Trace of process 95404 - 
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python -s -c from 
multiprocessing.resource_tracker import main;main(30)
   strace: Process 95404 attached
   read(30,

I have tried this twice and both times the installation has stopped like
this, so I assume it is not some temporary issue with the file system.

Does anyone have any ideas about what else I could look at?



Not sure at first sight, but maybe it's similar to a problem I ran into 
with scipy recently, which boiled down to pytext-xdist getting stuck 
when running in cgroups (for example when running from a Slurm job), see 
https://github.com/pytest-dev/pytest-xdist/issues/658 .


Doesn't look like it's exactly the same problem, but perhaps it gives 
you a push in the right direction...



regards,

Kenneth



Cheers,

Loris





[easybuild] dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb - tests stalled

2023-08-29 Thread Loris Bennett
Hi,

When I try to install 'dorado' via

  dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb

the tests stall at some point.  The process tree is as follows:

  ├─ /bin/bash /var/spool/slurmd/job14269886/slurm_script
  │  └─ /usr/bin/python3.6 -m easybuild.main 
dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb --robot 
--cuda-compute-capabilities=6.1,7.5 --buildpath=/dev/shm --tmpdir=/scratch/eb-bu
  │ └─ /bin/bash -c export 
PYTHONPATH=/scratch/eb-build/eb-yoirmakd/tmpc14mrksg/lib/python3.10/site-packages:$PYTHONPATH
 &&  cd test && PYTHONUNBUFFERED=1 /trinity/shar
  │└─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
run_test.py --continue-through-error --verbose -x 
distributed/elastic/utils/distrib
  │   ├─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v --subprocess
  │   │  ├─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
  │   │  │  ├─ 
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
  │   │  │  ...
  │   │  │  ├─ 
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
  │   │  │  ├─ 
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python -s -c from 
multiprocessing.resource_tracker import main;main(30)
  │   │  │  ├─ 
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
  │   │  │  ├─ 
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
  │   │  │  └─ 
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
  │   │  └─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v --subprocess
  │   └─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
run_test.py --continue-through-error --verbose -x distributed/elastic/utils/dist

The problem seems to be the that the following process hangs while
calling 'read':

  Trace of process 95404 - 
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python -s -c from 
multiprocessing.resource_tracker import main;main(30)
  strace: Process 95404 attached
  read(30,

I have tried this twice and both times the installation has stopped like
this, so I assume it is not some temporary issue with the file system.

Does anyone have any ideas about what else I could look at?

Cheers,

Loris

-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin