Loris Bennett <[email protected]> writes: > Hi Kenneth, > > Kenneth Hoste <[email protected]> writes: > >> Hi Loris, >> >> On 29/08/2023 08:19, Loris Bennett wrote: >>> Hi, >>> When I try to install 'dorado' via >>> dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb >>> the tests stall at some point. The process tree is as follows: >>> ├─ /bin/bash /var/spool/slurmd/job14269886/slurm_script >>> │ └─ /usr/bin/python3.6 -m easybuild.main >>> dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb --robot >>> --cuda-compute-capabilities=6.1,7.5 --buildpath=/dev/shm >>> --tmpdir=/scratch/eb-bu >>> │ └─ /bin/bash -c export >>> PYTHONPATH=/scratch/eb-build/eb-yoirmakd/tmpc14mrksg/lib/python3.10/site-packages:$PYTHONPATH >>> && cd test && PYTHONUNBUFFERED=1 /trinity/shar >>> │ └─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >>> run_test.py --continue-through-error --verbose -x >>> distributed/elastic/utils/distrib >>> │ ├─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >>> distributed/rpc/test_share_memory.py -v --subprocess >>> │ │ ├─ >>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case >>> │ │ │ ├─ >>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case >>> │ │ │ ... >>> │ │ │ ├─ >>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case >>> │ │ │ ├─ >>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python -s -c >>> from multiprocessing.resource_tracker import main;main(30) >>> │ │ │ ├─ >>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case >>> │ │ │ ├─ >>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case >>> │ │ │ └─ >>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case >>> │ │ └─ >>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >>> distributed/rpc/test_share_memory.py -v --subprocess >>> │ └─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >>> run_test.py --continue-through-error --verbose -x >>> distributed/elastic/utils/dist >>> The problem seems to be the that the following process hangs while >>> calling 'read': >>> Trace of process 95404 - >>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python -s -c >>> from multiprocessing.resource_tracker import main;main(30) >>> strace: Process 95404 attached >>> read(30, >>> I have tried this twice and both times the installation has stopped >>> like >>> this, so I assume it is not some temporary issue with the file system. >>> Does anyone have any ideas about what else I could look at? >> >> >> Not sure at first sight, but maybe it's similar to a problem I ran >> into with scipy recently, which boiled down to pytext-xdist getting >> stuck when running in cgroups (for example when running from a Slurm >> job), see https://github.com/pytest-dev/pytest-xdist/issues/658 . >> >> Doesn't look like it's exactly the same problem, but perhaps it gives >> you a push in the right direction... > > Thanks for the tip. However, even with --parallel=1 the test hangs at > the same place - it just takes longer to get there :-) > > I guess I'll just try skipping the test for the moment.
Dorado depends on PyTorch-1.12.0-foss-2022a-CUDA-11.7.0.eb so I tried to build this first, but ran into the same issue. Even with --skip-test-cases, the build seems to have stopped. The process tree looks like: ├─ /bin/bash /var/spool/slurmd/job14436854/slurm_script │ └─ /usr/bin/python3.6 -m easybuild.main PyTorch-1.12.0-foss-2022a-CUDA-11.7.0.eb --robot --cuda-compute-capabilities=6.1,7.5 --buildpath=/dev/shm --tmpdir=/scratch/eb-build --skip-test-cases │ └─ /bin/bash -c export PYTHONPATH=/scratch/eb-build/eb-90y_n4w9/tmp4papiynj/lib/python3.10/site-packages:$PYTHONPATH && cd test && PYTHONUNBUFFERED=1 /trinity/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python run_test.py --continue-through-error --verbose -x distributed/elastic/utils/distributed_test distributed/elastic/multiprocessing/api_test distributed/test_distributed_spawn test_optim test_model_dump distributed/fsdp/test_fsdp_memory distributed/fsdp/test_fsdp_overlap │ └─ /trinity/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python run_test.py --continue-through-error --verbose -x distributed/elastic/utils/distributed_test distributed/elastic/multiprocessing/api_test distributed/test_distributed_spawn test_optim test_model_dump distributed/fsdp/test_fsdp_memory distributed/fsdp/test_fsdp_overlap │ ├─ /trinity/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python distributed/rpc/test_share_memory.py -v --subprocess │ │ ├─ /trinity/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case │ │ │ ├─ /trinity/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case │ │ │ ├─ /trinity/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case │ │ │ ├─ /trinity/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case │ │ │ ├─ /trinity/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case │ │ │ ├─ /trinity/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case │ │ │ ├─ /trinity/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case │ │ │ ├─ /trinity/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case │ │ │ ├─ /trinity/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case │ │ │ ├─ /trinity/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case ... Is there some other option I can use to prevent this test being run? Cheers, Loris -- Dr. Loris Bennett (Herr/Mr) ZEDAT, Freie Universität Berlin

