Loris Bennett <[email protected]> writes:

> Hi Kenneth,
>
> Kenneth Hoste <[email protected]> writes:
>
>> Hi Loris,
>>
>> On 29/08/2023 08:19, Loris Bennett wrote:
>>> Hi,
>>> When I try to install 'dorado' via
>>>    dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb
>>> the tests stall at some point.  The process tree is as follows:
>>>    ├─ /bin/bash /var/spool/slurmd/job14269886/slurm_script
>>>    │ └─ /usr/bin/python3.6 -m easybuild.main
>>> dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb --robot
>>> --cuda-compute-capabilities=6.1,7.5 --buildpath=/dev/shm
>>> --tmpdir=/scratch/eb-bu
>>>    │ └─ /bin/bash -c export
>>> PYTHONPATH=/scratch/eb-build/eb-yoirmakd/tmpc14mrksg/lib/python3.10/site-packages:$PYTHONPATH
>>> && cd test && PYTHONUNBUFFERED=1 /trinity/shar
>>>    │ └─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>>> run_test.py --continue-through-error --verbose -x
>>> distributed/elastic/utils/distrib
>>>    │ ├─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>>> distributed/rpc/test_share_memory.py -v --subprocess
>>>    │ │ ├─
>>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
>>>    │ │ │ ├─
>>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
>>>    │           │  │  ...
>>>    │ │ │ ├─
>>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
>>>    │ │ │ ├─
>>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python -s -c
>>> from multiprocessing.resource_tracker import main;main(30)
>>>    │ │ │ ├─
>>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
>>>    │ │ │ ├─
>>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
>>>    │ │ │ └─
>>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
>>>    │ │ └─
>>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>>> distributed/rpc/test_share_memory.py -v --subprocess
>>>    │ └─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>>> run_test.py --continue-through-error --verbose -x
>>> distributed/elastic/utils/dist
>>> The problem seems to be the that the following process hangs while
>>> calling 'read':
>>>    Trace of process 95404 -
>>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python -s -c
>>> from multiprocessing.resource_tracker import main;main(30)
>>>    strace: Process 95404 attached
>>>    read(30,
>>> I have tried this twice and both times the installation has stopped
>>> like
>>> this, so I assume it is not some temporary issue with the file system.
>>> Does anyone have any ideas about what else I could look at?
>>
>>
>> Not sure at first sight, but maybe it's similar to a problem I ran
>> into with scipy recently, which boiled down to pytext-xdist getting
>> stuck when running in cgroups (for example when running from a Slurm
>> job), see https://github.com/pytest-dev/pytest-xdist/issues/658 .
>>
>> Doesn't look like it's exactly the same problem, but perhaps it gives
>> you a push in the right direction...
>
> Thanks for the tip.  However, even with --parallel=1 the test hangs at
> the same place - it just takes longer to get there :-)
>
> I guess I'll just try skipping the test for the moment.

Dorado depends on 

  PyTorch-1.12.0-foss-2022a-CUDA-11.7.0.eb

so I tried to build this first, but ran into the same issue.

Even with --skip-test-cases, the build seems to have stopped.  The
process tree looks like:

  ├─ /bin/bash /var/spool/slurmd/job14436854/slurm_script
  │  └─ /usr/bin/python3.6 -m easybuild.main 
PyTorch-1.12.0-foss-2022a-CUDA-11.7.0.eb --robot 
--cuda-compute-capabilities=6.1,7.5 --buildpath=/dev/shm 
--tmpdir=/scratch/eb-build --skip-test-cases
  │     └─ /bin/bash -c export 
PYTHONPATH=/scratch/eb-build/eb-90y_n4w9/tmp4papiynj/lib/python3.10/site-packages:$PYTHONPATH
 &&  cd test && PYTHONUNBUFFERED=1 
/trinity/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
run_test.py --continue-through-error  --verbose -x 
distributed/elastic/utils/distributed_test 
distributed/elastic/multiprocessing/api_test distributed/test_distributed_spawn 
test_optim test_model_dump distributed/fsdp/test_fsdp_memory 
distributed/fsdp/test_fsdp_overlap
  │        └─ 
/trinity/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
run_test.py --continue-through-error --verbose -x 
distributed/elastic/utils/distributed_test 
distributed/elastic/multiprocessing/api_test distributed/test_distributed_spawn 
test_optim test_model_dump distributed/fsdp/test_fsdp_memory 
distributed/fsdp/test_fsdp_overlap
  │           ├─ 
/trinity/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v --subprocess
  │           │  ├─ 
/trinity/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
  │           │  │  ├─ 
/trinity/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
  │           │  │  ├─ 
/trinity/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
  │           │  │  ├─ 
/trinity/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
  │           │  │  ├─ 
/trinity/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
  │           │  │  ├─ 
/trinity/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
  │           │  │  ├─ 
/trinity/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
  │           │  │  ├─ 
/trinity/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
  │           │  │  ├─ 
/trinity/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
  │           │  │  ├─ 
/trinity/shared/easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python 
distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
  ...
 
Is there some other option I can use to prevent this test being run?

Cheers,

Loris

-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin

Reply via email to