Hi Kenneth,

Kenneth Hoste <kenneth.ho...@ugent.be> writes:

> Hi Loris,
>
> On 29/08/2023 08:19, Loris Bennett wrote:
>> Hi,
>> When I try to install 'dorado' via
>>    dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb
>> the tests stall at some point.  The process tree is as follows:
>>    ├─ /bin/bash /var/spool/slurmd/job14269886/slurm_script
>>    │ └─ /usr/bin/python3.6 -m easybuild.main
>> dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb --robot
>> --cuda-compute-capabilities=6.1,7.5 --buildpath=/dev/shm
>> --tmpdir=/scratch/eb-bu
>>    │ └─ /bin/bash -c export
>> PYTHONPATH=/scratch/eb-build/eb-yoirmakd/tmpc14mrksg/lib/python3.10/site-packages:$PYTHONPATH
>> && cd test && PYTHONUNBUFFERED=1 /trinity/shar
>>    │ └─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>> run_test.py --continue-through-error --verbose -x
>> distributed/elastic/utils/distrib
>>    │ ├─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>> distributed/rpc/test_share_memory.py -v --subprocess
>>    │ │ ├─
>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
>>    │ │ │ ├─
>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
>>    │           │  │  ...
>>    │ │ │ ├─
>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
>>    │ │ │ ├─
>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python -s -c
>> from multiprocessing.resource_tracker import main;main(30)
>>    │ │ │ ├─
>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
>>    │ │ │ ├─
>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
>>    │ │ │ └─
>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case
>>    │ │ └─
>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>> distributed/rpc/test_share_memory.py -v --subprocess
>>    │ └─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python
>> run_test.py --continue-through-error --verbose -x
>> distributed/elastic/utils/dist
>> The problem seems to be the that the following process hangs while
>> calling 'read':
>>    Trace of process 95404 -
>> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python -s -c
>> from multiprocessing.resource_tracker import main;main(30)
>>    strace: Process 95404 attached
>>    read(30,
>> I have tried this twice and both times the installation has stopped
>> like
>> this, so I assume it is not some temporary issue with the file system.
>> Does anyone have any ideas about what else I could look at?
>
>
> Not sure at first sight, but maybe it's similar to a problem I ran
> into with scipy recently, which boiled down to pytext-xdist getting
> stuck when running in cgroups (for example when running from a Slurm
> job), see https://github.com/pytest-dev/pytest-xdist/issues/658 .
>
> Doesn't look like it's exactly the same problem, but perhaps it gives
> you a push in the right direction...

Thanks for the tip.  However, even with --parallel=1 the test hangs at
the same place - it just takes longer to get there :-)

I guess I'll just try skipping the test for the moment.

Cheers,

Loris

> regards,
>
> Kenneth
>
>> Cheers,
>> Loris
>> 
>
-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin

Reply via email to