Hi,

I have been dealing with Ray for few days and I got it running in our
cluster as follows:

mpiexec -n 64 Ray \
> -read-write-checkpoints \
> -o assembly \
> -p \

/srv/scratch/z3382651/metagenomes/VZQ515A1-27724777/028-LFA_S1_L001_R1_001_val_1.fq
> \
> /srv/scratch/z3382651/metagenomes/VZQ515A1-27724777/028-LFA_S1_L001_R2_001_val_2.fq
> \
> ...
>

but it crashes without leaving an error message in the log (see
ray_1strun.log). The only thing I could find as error was
RAY_MPI_TAG_NOTIFY_ERROR but I have no idea what it is.

I tried to resume it few times, often ending in the same way, others
showing dirty buffer messages in the middle of the log and in another
situation, what I believe it was, a problem of the nodes communicating
and/or the MPI:

[kc07b05][[3753,1],51][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> [kc06b05][[3753,1],8][btl_tcp_endpoint.c:657:mca_btl_tcp_endpoint_complete_connect]
> connect() to 10.3.2.15 failed: Connection refused (111)
> [kc06b15][[3753,1],31][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> [kc06b13][[3753,1],17][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [kc07b04][[3753,1],42][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [kc07b05][[3753,1],53][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [kc06b15][[3753,1],28][btl_tcp_endpoint.c:657:mca_btl_tcp_endpoint_complete_connect]
> connect() to 10.3.2.15 failed: Connection refused (111)
> [kc07b05][[3753,1],49][btl_tcp_endpoint.c:657:mca_btl_tcp_endpoint_complete_connect]
> connect() to 10.3.2.15 failed: Connection refused (111)
> [kc07b05][[3753,1],48][btl_tcp_endpoint.c:657:mca_btl_tcp_endpoint_complete_connect]
> connect() to 10.3.2.15 failed: Connection refused (111)
> [kc07b04][[3753,1],46][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [kc07b05][[3753,1],50][btl_tcp_endpoint.c:657:mca_btl_tcp_endpoint_complete_connect]
> [kc07b01][[3753,1],38][btl_tcp_endpoint.c:657:mca_btl_tcp_endpoint_complete_connect]
> connect() to 10.3.2.15 failed: Connection refused (111)
> [kc06b15][[3753,1],30][btl_tcp_endpoint.c:657:mca_btl_tcp_endpoint_complete_connect]
> connect() to 10.3.2.15 failed: Connection refused (111)
> [kc06b13][[3753,1],16][btl_tcp_endpoint.c:657:mca_btl_tcp_endpoint_complete_connect]
> connect() to 10.3.2.15 failed: Connection refused (111)
> [kc06b15][[3753,1],24][btl_tcp_endpoint.c:657:mca_btl_tcp_endpoint_complete_connect]
> connect() to 10.3.2.15 failed: Connection refused (111)
> [kc07b04][[3753,1],45][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> connect() to 10.3.2.15 failed: Connection refused (111)
> [kc07b01][[3753,1],36][btl_tcp_endpoint.c:657:mca_btl_tcp_endpoint_complete_connect]
> connect() to 10.3.2.15 failed: Connection refused (111)
> [kc07b04][[3753,1],40][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [kc07b05][[3753,1],52][btl_tcp_endpoint.c:657:mca_btl_tcp_endpoint_complete_connect]
> connect() to 10.3.2.15 failed: Connection refused (111)
> [kc07b01][[3753,1],34][btl_tcp_endpoint.c:657:mca_btl_tcp_endpoint_complete_connect]
> connect() to 10.3.2.15 failed: Connection refused (111)
> [kc06b15][[3753,1],25][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> [kc07b01][[3753,1],37][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [kc07b01][[3753,1],39][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> [kc07b01][[3753,1],35][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [kc07b01][[3753,1],33][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>

I don't know what can be wrong.

By the way, I'm using 8 nodes with 8 processes per node and a total vmem of
196GB. Our cluster has more than 100 blades with a minimum of 12
cores/blade and most of them with 96GB each. I leave the details in case
the config of my job submitted to the cluster can be one of the problems or
even if it can be improved.

By the way, the job includes several libraries as fastq, which together
(uncompressed) add up almost 150GB

Thank you in advance,

Xabier




-- 
Xabier Vázquez-Campos, *PhD*
*Research Associate*
Water Research Centre
School of Civil and Environmental Engineering
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
------------------------------------------------------------------------------
_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Reply via email to