Hello,

Sounds strange to me.

Latency is correct, but a little high for Infiniband. But that should not 
matter much.

Example:

Rank 48: average latency for n00002 when requesting a reply for a message of 
4000 bytes is 299 microseconds (10^-6 seconds)


Correctly configured Infiniband QDR (QLogic or Mellanox) should report a value 
between 15 and 20 microseconds.

But again, that is not your problem.


All files exist and their sequences get counted correctly.

Rank 0: [1/34] /projects/olea/assembly/NGS/filteredReads/exp050_7_1.fastq -> 
partition is [0;11509421], 11509422 sequence reads
Rank 0: [2/34] /projects/olea/assembly/NGS/filteredReads/exp050_7_2.fastq -> 
partition is [11509422;23018843], 11509422 sequence reads
Rank 0: [3/34] /projects/olea/assembly/NGS/filteredReads/exp105_1_1.fastq -> 
partition is [23018844;63440438], 40421595 sequence reads
Rank 0: [4/34] /projects/olea/assembly/NGS/filteredReads/exp105_1_2.fastq -> 
partition is [63440439;103862033], 40421595 sequence reads
Rank 0: [5/34] /projects/olea/assembly/NGS/filteredReads/exp115_4_ind1_1.fastq 
-> partition is [103862034;127351259], 23489226 sequence reads
Rank 0: [6/34] /projects/olea/assembly/NGS/filteredReads/exp115_4_ind1_2.fastq 
-> partition is [127351260;150840485], 23489226 sequence reads
Rank 0: [7/34] /projects/olea/assembly/NGS/filteredReads/exp115_4_ind2_1.fastq 
-> partition is [150840486;183768911], 32928426 sequence reads
Rank 0: [8/34] /projects/olea/assembly/NGS/filteredReads/exp115_4_ind2_2.fastq 
-> partition is [183768912;216697337], 32928426 sequence reads
Rank 0: [9/34] /projects/olea/assembly/NGS/filteredReads/exp115_5_ind1_1.fastq 
-> partition is [216697338;235303875], 18606538 sequence reads
Rank 0: [10/34] /projects/olea/assembly/NGS/filteredReads/exp115_5_ind1_2.fastq 
-> partition is [235303876;253910413], 18606538 sequence reads
Rank 0: [11/34] /projects/olea/assembly/NGS/filteredReads/exp115_5_ind2_1.fastq 
-> partition is [253910414;280830591], 26920178 sequence reads
Rank 0: [12/34] /projects/olea/assembly/NGS/filteredReads/exp115_5_ind2_2.fastq 
-> partition is [280830592;307750769], 26920178 sequence reads
Rank 0: [13/34] /projects/olea/assembly/NGS/filteredReads/exp115_8_ind2_1.fastq 
-> partition is [307750770;316667816], 8917047 sequence reads
Rank 0: [14/34] /projects/olea/assembly/NGS/filteredReads/exp115_8_ind2_2.fastq 
-> partition is [316667817;325584863], 8917047 sequence reads
Rank 0: [15/34] /projects/olea/assembly/NGS/filteredReads/exp115_8_ind3_1.fastq 
-> partition is [325584864;337795427], 12210564 sequence reads
Rank 0: [16/34] /projects/olea/assembly/NGS/filteredReads/exp115_8_ind3_2.fastq 
-> partition is [337795428;350005991], 12210564 sequence reads
Rank 0: [17/34] /projects/olea/assembly/NGS/filteredReads/exp117_1_1.fastq -> 
partition is [350005992;381443684], 31437693 sequence reads
Rank 0: [18/34] /projects/olea/assembly/NGS/filteredReads/exp117_1_2.fastq -> 
partition is [381443685;412881377], 31437693 sequence reads
Rank 0: [19/34] /projects/olea/assembly/NGS/filteredReads/exp117_2_1.fastq -> 
partition is [412881378;444289639], 31408262 sequence reads
Rank 0: [20/34] /projects/olea/assembly/NGS/filteredReads/exp117_2_2.fastq -> 
partition is [444289640;475697901], 31408262 sequence reads
Rank 0: [21/34] /projects/olea/assembly/NGS/filteredReads/exp117_3_ind1_1.fastq 
-> partition is [475697902;488606736], 12908835 sequence reads
Rank 0: [22/34] /projects/olea/assembly/NGS/filteredReads/exp117_3_ind1_2.fastq 
-> partition is [488606737;501515571], 12908835 sequence reads
Rank 0: [23/34] /projects/olea/assembly/NGS/filteredReads/exp117_3_ind2_1.fastq 
-> partition is [501515572;514875189], 13359618 sequence reads
Rank 0: [24/34] /projects/olea/assembly/NGS/filteredReads/exp117_3_ind2_2.fastq 
-> partition is [514875190;528234807], 13359618 sequence reads
Rank 0: [25/34] /projects/olea/assembly/NGS/filteredReads/exp117_4_ind1_1.fastq 
-> partition is [528234808;541341853], 13107046 sequence reads
Rank 0: [26/34] /projects/olea/assembly/NGS/filteredReads/exp117_4_ind1_2.fastq 
-> partition is [541341854;554448899], 13107046 sequence reads
Rank 0: [27/34] /projects/olea/assembly/NGS/filteredReads/exp117_4_ind2_1.fastq 
-> partition is [554448900;568062902], 13614003 sequence reads
Rank 0: [28/34] /projects/olea/assembly/NGS/filteredReads/exp117_4_ind2_2.fastq 
-> partition is [568062903;581676905], 13614003 sequence reads
Rank 0: [29/34] /projects/olea/lanes/sequences/dna/454_SanMichele/G0OG3UG01.sff 
-> partition is [581676906;582362443], 685538 sequence reads
Rank 0: [30/34] /projects/olea/lanes/sequences/dna/454_SanMichele/G1CMX8Y01.sff 
-> partition is [582362444;582940397], 577954 sequence reads
Rank 0: [31/34] /projects/olea/lanes/sequences/dna/454_SanMichele/G1CMX8Y02.sff 
-> partition is [582940398;583394190], 453793 sequence reads
Rank 0: [32/34] /projects/olea/lanes/sequences/dna/454_SanMichele/GWBHXVK04.sff 
-> partition is [583394191;583521268], 127078 sequence reads
Rank 0: [33/34] /projects/olea/lanes/sequences/dna/454_SanMichele/GXHOTPW01.sff 
-> partition is [583521269;584053495], 532227 sequence reads
Rank 0: [34/34] /projects/olea/lanes/sequences/dna/454_SanMichele/GXUOK6A01.sff 
-> partition is [584053496;584586920], 533425 sequence reads


But then

Out of 64 MPI ranks, only 56 MPI ranks completed the loading of their sequences.

Question is, what are doing the remaining 8 ?

I understand you use 64 MPI ranks and Infiniband. Is your setup like 8 
computers connected with Infiniband with
each computer having 8 Xeon cores or something like that. (Xeon Nehalem are 
popular !)


The 8 missing MPI ranks are the numbers 8-15. These are still loading sequences.

boisvert@oll-sboisvert:~/Desktop$ grep 00007 rayHybridOlea.o135758 |grep Proc
Rank 8: Rank= 8 Size= 64 ProcessIdentifier= 13261 ProcessorName= n00007
Rank 9: Rank= 9 Size= 64 ProcessIdentifier= 13262 ProcessorName= n00007
Rank 10: Rank= 10 Size= 64 ProcessIdentifier= 13263 ProcessorName= n00007
Rank 12: Rank= 12 Size= 64 ProcessIdentifier= 13265 ProcessorName= n00007
Rank 13: Rank= 13 Size= 64 ProcessIdentifier= 13266 ProcessorName= n00007
Rank 11: Rank= 11 Size= 64 ProcessIdentifier= 13264 ProcessorName= n00007
Rank 14: Rank= 14 Size= 64 ProcessIdentifier= 13267 ProcessorName= n00007
Rank 15: Rank= 15 Size= 64 ProcessIdentifier= 13268 ProcessorName= n00007

They are all on n00007.

Are the nodes shared with other users -- is Ray running on n00007 while other 
programs from other users also run on n00007 ?

In my experience, this is the biggest source of MPI job failures.

My guess is that n00007 is paging because of another user oversubscribing 
n00007.


You can avoid that altogether by configuring your cluster scheduler on a 
per-node basis.


How do you launch your job on the cluster ?

With Sun Grid Engine ?

If so, allocation_rule should be 8, not round_robin or fill_up as these will 
cause a lot of failing jobs due to
some users oversubscribing the nodes.

With Portable Batch System ?





                                                     Sébastien
________________________________________
De : Francesco Vezzi [ve...@appliedgenomics.org]
Date d'envoi : 23 août 2011 10:09
À : Sébastien Boisvert
Objet : Re: RE : Ray behavior

Dear Seb
in attachment you can find the standard error and standard output produced by 
Ray. I forget to redirected them so I was oblige to kill the application in 
order to give them a look.

It seems that Ray was processing the reads.

Francesco

----- Original Message -----
From: "Sébastien Boisvert" <sebastien.boisver...@ulaval.ca>
To: "Francesco Vezzi" <ve...@appliedgenomics.org>
Cc: denovoassembler-us...@lists.sf.net
Sent: Tuesday, August 23, 2011 3:38:02 PM
Subject: RE : Ray behavior

Hello Francesco,

What is the content of the standard output and standard error ?



> ________________________________________
> De : Francesco Vezzi [ve...@appliedgenomics.org]
> Date d'envoi : 23 août 2011 09:15
> À : Sébastien Boisvert
> Objet : Ray behavior
>
> Dear Sebastien
> I`m using Ray in order to assembly an high coverage of the olive genome 
> (~50X) composed mainly by Illumina reads and some unpaired 454 reads.
> I lunched Ray a week ago on our cluster composed by 10 nodes with 8 cores 
> each connected through infinyband. On a dataset 4 times smaller, using the 
> same hardware, I was able to obtain an assembly in 3 days with Abyss.
>
> The strange behavior is the fact that after one week the folder where the 
> program is running contains only 3 files: RayOutput.NetworkTest.txt, 
> RayOutput.RayCommand.txt and, RayOutput.RayVersion.txt
> Moreover, while all the used cpus are running at 100%, the amount of used RAM 
> in the nodes is exceptionally low (less than 5GB RAM per node).
>
> In your experience, is this behavior normal or the program is likely to be 
> stuck in some deadlock?
>
> Best
> Francesco
>
>
>

                  Sébastien Boisvert
                  http://github.com/sebhtml/ray

------------------------------------------------------------------------------
Get a FREE DOWNLOAD! and learn more about uberSVN rich system, 
user administration capabilities and model configuration. Take 
the hassle out of deploying and managing Subversion and the 
tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Reply via email to