Hi Loris,

On 29/08/2023 08:19, Loris Bennett wrote:
Hi,

When I try to install 'dorado' via

   dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb

the tests stall at some point.  The process tree is as follows:

   ├─ /bin/bash /var/spool/slurmd/job14269886/slurm_script
   │  └─ /usr/bin/python3.6 -m easybuild.main 
dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb --robot 
--cuda-compute-capabilities=6.1,7.5 --buildpath=/dev/shm --tmpdir=/scratch/eb-bu
   │     └─ /bin/bash -c export 
PYTHONPATH=/scratch/eb-build/eb-yoirmakd/tmpc14mrksg/lib/python3.10/site-packages:$PYTHONPATH 
&&  cd test && PYTHONUNBUFFERED=1 /trinity/shar
   │        └─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
run_test.py --continue-through-error --verbose -x 
distributed/elastic/utils/distrib
   │           ├─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v --subprocess
   │           │  ├─ 
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
   │           │  │  ├─ 
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
   │           │  │  ...
   │           │  │  ├─ 
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
   │           │  │  ├─ 
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python -s -c from 
multiprocessing.resource_tracker import main;main(30)
   │           │  │  ├─ 
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
   │           │  │  ├─ 
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
   │           │  │  └─ 
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
   │           │  └─ 
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v --subprocess
   │           └─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
run_test.py --continue-through-error --verbose -x distributed/elastic/utils/dist

The problem seems to be the that the following process hangs while
calling 'read':

   Trace of process 95404 - 
/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python -s -c from 
multiprocessing.resource_tracker import main;main(30)
   strace: Process 95404 attached
   read(30,

I have tried this twice and both times the installation has stopped like
this, so I assume it is not some temporary issue with the file system.

Does anyone have any ideas about what else I could look at?


Not sure at first sight, but maybe it's similar to a problem I ran into with scipy recently, which boiled down to pytext-xdist getting stuck when running in cgroups (for example when running from a Slurm job), see https://github.com/pytest-dev/pytest-xdist/issues/658 .

Doesn't look like it's exactly the same problem, but perhaps it gives you a push in the right direction...


regards,

Kenneth


Cheers,

Loris


Reply via email to