Re: [easybuild] dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb - tests stalled
Loris Bennett writes: > Hi Kenneth, > > Kenneth Hoste writes: > >> Hi Loris, >> >> On 29/08/2023 08:19, Loris Bennett wrote: >>> Hi, >>> When I try to install 'dorado' via >>>dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb >>> the tests stall at some point. The process tree is as follows: >>>├─ /bin/bash /var/spool/slurmd/job14269886/slurm_script >>>│ └─ /usr/bin/python3.6 -m easybuild.main >>> dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb --robot >>> --cuda-compute-capabilities=6.1,7.5 --buildpath=/dev/shm >>> --tmpdir=/scratch/eb-bu >>>│ └─ /bin/bash -c export >>> PYTHONPATH=/scratch/eb-build/eb-yoirmakd/tmpc14mrksg/lib/python3.10/site-packages:$PYTHONPATH >>> && cd test && PYTHONUNBUFFERED=1 /trinity/shar >>>│ └─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >>> run_test.py --continue-through-error --verbose -x >>> distributed/elastic/utils/distrib >>>│ ├─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >>> distributed/rpc/test_share_memory.py -v --subprocess >>>│ │ ├─ >>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case >>>│ │ │ ├─ >>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case >>>│ │ │ ... >>>│ │ │ ├─ >>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case >>>│ │ │ ├─ >>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python -s -c >>> from multiprocessing.resource_tracker import main;main(30) >>>│ │ │ ├─ >>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case >>>│ │ │ ├─ >>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case >>>│ │ │ └─ >>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case >>>│ │ └─ >>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >>> distributed/rpc/test_share_memory.py -v --subprocess >>>│ └─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >>> run_test.py --continue-through-error --verbose -x >>> distributed/elastic/utils/dist >>> The problem seems to be the that the following process hangs while >>> calling 'read': >>>Trace of process 95404 - >>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python -s -c >>> from multiprocessing.resource_tracker import main;main(30) >>>strace: Process 95404 attached >>>read(30, >>> I have tried this twice and both times the installation has stopped >>> like >>> this, so I assume it is not some temporary issue with the file system. >>> Does anyone have any ideas about what else I could look at? >> >> >> Not sure at first sight, but maybe it's similar to a problem I ran >> into with scipy recently, which boiled down to pytext-xdist getting >> stuck when running in cgroups (for example when running from a Slurm >> job), see https://github.com/pytest-dev/pytest-xdist/issues/658 . >> >> Doesn't look like it's exactly the same problem, but perhaps it gives >> you a push in the right direction... > > Thanks for the tip. However, even with --parallel=1 the test hangs at > the same place - it just takes longer to get there :-) > > I guess I'll just try skipping the test for the moment. Dorado depends on PyTorch-1.12.0-foss-2022a-CUDA-11.7.0.eb so I tried to build this first, but ran into the same issue. Even with --skip-test-cases, the build seems to have stopped. The process tree looks like: ├─ /bin/bash /var/spool/slurmd/job14436854/slurm_script │ └─ /usr/bin/python3.6 -m easybuild.main PyTorch-1.12.0-foss-2022a-CUDA-11.7.0.eb --robot --cuda-compute-capabilities=6.1,7.5 --buildpath=/dev/shm --tmpdir=/scratch/eb-build --skip-test-cases │ └─ /bin/bash -c export PYTHONPATH=/scratch/eb-build/eb-90y_n4w9/tmp4papiynj/lib/python3.10/site-packages:$PYTHONPATH && cd test && PYTHONUNBUFFERED=1 /trinity/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python run_test.py --continue-through-error --verbose -x distributed/elastic/utils/distributed_test distributed/elastic/multiprocessing/api_test distributed/test_distributed_spawn test_optim test_model_dump distributed/fsdp/test_fsdp_memory distributed/fsdp/test_fsdp_overlap │└─ /trinity/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python run_test.py --continue-through-error --verbose -x distributed/elastic/utils/distributed_test distributed/elastic/multiprocessing/api_test distributed/test_distributed_spawn test_optim test_model_dump distributed/fsdp/test_fsdp_memory distributed/fsdp/test_fsdp_overlap │ ├─ /trinity/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python distributed/rpc/test_share_memory.py -v
Re: [easybuild] dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb - tests stalled
Hi Kenneth, Kenneth Hoste writes: > Hi Loris, > > On 29/08/2023 08:19, Loris Bennett wrote: >> Hi, >> When I try to install 'dorado' via >>dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb >> the tests stall at some point. The process tree is as follows: >>├─ /bin/bash /var/spool/slurmd/job14269886/slurm_script >>│ └─ /usr/bin/python3.6 -m easybuild.main >> dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb --robot >> --cuda-compute-capabilities=6.1,7.5 --buildpath=/dev/shm >> --tmpdir=/scratch/eb-bu >>│ └─ /bin/bash -c export >> PYTHONPATH=/scratch/eb-build/eb-yoirmakd/tmpc14mrksg/lib/python3.10/site-packages:$PYTHONPATH >> && cd test && PYTHONUNBUFFERED=1 /trinity/shar >>│ └─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >> run_test.py --continue-through-error --verbose -x >> distributed/elastic/utils/distrib >>│ ├─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >> distributed/rpc/test_share_memory.py -v --subprocess >>│ │ ├─ >> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case >>│ │ │ ├─ >> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case >>│ │ │ ... >>│ │ │ ├─ >> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case >>│ │ │ ├─ >> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python -s -c >> from multiprocessing.resource_tracker import main;main(30) >>│ │ │ ├─ >> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case >>│ │ │ ├─ >> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case >>│ │ │ └─ >> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case >>│ │ └─ >> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >> distributed/rpc/test_share_memory.py -v --subprocess >>│ └─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >> run_test.py --continue-through-error --verbose -x >> distributed/elastic/utils/dist >> The problem seems to be the that the following process hangs while >> calling 'read': >>Trace of process 95404 - >> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python -s -c >> from multiprocessing.resource_tracker import main;main(30) >>strace: Process 95404 attached >>read(30, >> I have tried this twice and both times the installation has stopped >> like >> this, so I assume it is not some temporary issue with the file system. >> Does anyone have any ideas about what else I could look at? > > > Not sure at first sight, but maybe it's similar to a problem I ran > into with scipy recently, which boiled down to pytext-xdist getting > stuck when running in cgroups (for example when running from a Slurm > job), see https://github.com/pytest-dev/pytest-xdist/issues/658 . > > Doesn't look like it's exactly the same problem, but perhaps it gives > you a push in the right direction... Thanks for the tip. However, even with --parallel=1 the test hangs at the same place - it just takes longer to get there :-) I guess I'll just try skipping the test for the moment. Cheers, Loris > regards, > > Kenneth > >> Cheers, >> Loris >> > -- Dr. Loris Bennett (Herr/Mr) ZEDAT, Freie Universität Berlin
Re: [easybuild] dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb - tests stalled
Hi Loris, On 29/08/2023 08:19, Loris Bennett wrote: Hi, When I try to install 'dorado' via dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb the tests stall at some point. The process tree is as follows: ├─ /bin/bash /var/spool/slurmd/job14269886/slurm_script │ └─ /usr/bin/python3.6 -m easybuild.main dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb --robot --cuda-compute-capabilities=6.1,7.5 --buildpath=/dev/shm --tmpdir=/scratch/eb-bu │ └─ /bin/bash -c export PYTHONPATH=/scratch/eb-build/eb-yoirmakd/tmpc14mrksg/lib/python3.10/site-packages:$PYTHONPATH && cd test && PYTHONUNBUFFERED=1 /trinity/shar │└─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python run_test.py --continue-through-error --verbose -x distributed/elastic/utils/distrib │ ├─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python distributed/rpc/test_share_memory.py -v --subprocess │ │ ├─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case │ │ │ ├─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case │ │ │ ... │ │ │ ├─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case │ │ │ ├─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python -s -c from multiprocessing.resource_tracker import main;main(30) │ │ │ ├─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case │ │ │ ├─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case │ │ │ └─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case │ │ └─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python distributed/rpc/test_share_memory.py -v --subprocess │ └─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python run_test.py --continue-through-error --verbose -x distributed/elastic/utils/dist The problem seems to be the that the following process hangs while calling 'read': Trace of process 95404 - /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python -s -c from multiprocessing.resource_tracker import main;main(30) strace: Process 95404 attached read(30, I have tried this twice and both times the installation has stopped like this, so I assume it is not some temporary issue with the file system. Does anyone have any ideas about what else I could look at? Not sure at first sight, but maybe it's similar to a problem I ran into with scipy recently, which boiled down to pytext-xdist getting stuck when running in cgroups (for example when running from a Slurm job), see https://github.com/pytest-dev/pytest-xdist/issues/658 . Doesn't look like it's exactly the same problem, but perhaps it gives you a push in the right direction... regards, Kenneth Cheers, Loris