Hi Loris,
On 29/08/2023 08:19, Loris Bennett wrote:
Hi,
When I try to install 'dorado' via
dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb
the tests stall at some point. The process tree is as follows:
├─ /bin/bash /var/spool/slurmd/job14269886/slurm_script
│ └─ /usr/bin/python3.6 -m easybuild.main
dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb --robot
--cuda-compute-capabilities=6.1,7.5 --buildpath=/dev/shm --tmpdir=/scratch/eb-bu
│ └─ /bin/bash -c export
PYTHONPATH=/scratch/eb-build/eb-yoirmakd/tmpc14mrksg/lib/python3.10/site-packages:$PYTHONPATH
&& cd test && PYTHONUNBUFFERED=1 /trinity/shar
│ └─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
run_test.py --continue-through-error --verbose -x
distributed/elastic/utils/distrib
│ ├─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
distributed/rpc/test_share_memory.py -v --subprocess
│ │ ├─
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
│ │ │ ├─
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
│ │ │ ...
│ │ │ ├─
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
│ │ │ ├─
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python -s -c from
multiprocessing.resource_tracker import main;main(30)
│ │ │ ├─
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
│ │ │ ├─
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
│ │ │ └─
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
│ │ └─
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
distributed/rpc/test_share_memory.py -v --subprocess
│ └─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
run_test.py --continue-through-error --verbose -x distributed/elastic/utils/dist
The problem seems to be the that the following process hangs while
calling 'read':
Trace of process 95404 -
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python -s -c from
multiprocessing.resource_tracker import main;main(30)
strace: Process 95404 attached
read(30,
I have tried this twice and both times the installation has stopped like
this, so I assume it is not some temporary issue with the file system.
Does anyone have any ideas about what else I could look at?
Not sure at first sight, but maybe it's similar to a problem I ran into
with scipy recently, which boiled down to pytext-xdist getting stuck
when running in cgroups (for example when running from a Slurm job), see
https://github.com/pytest-dev/pytest-xdist/issues/658 .
Doesn't look like it's exactly the same problem, but perhaps it gives
you a push in the right direction...
regards,
Kenneth
Cheers,
Loris