Hello, Sounds strange to me.
Latency is correct, but a little high for Infiniband. But that should not matter much. Example: Rank 48: average latency for n00002 when requesting a reply for a message of 4000 bytes is 299 microseconds (10^-6 seconds) Correctly configured Infiniband QDR (QLogic or Mellanox) should report a value between 15 and 20 microseconds. But again, that is not your problem. All files exist and their sequences get counted correctly. Rank 0: [1/34] /projects/olea/assembly/NGS/filteredReads/exp050_7_1.fastq -> partition is [0;11509421], 11509422 sequence reads Rank 0: [2/34] /projects/olea/assembly/NGS/filteredReads/exp050_7_2.fastq -> partition is [11509422;23018843], 11509422 sequence reads Rank 0: [3/34] /projects/olea/assembly/NGS/filteredReads/exp105_1_1.fastq -> partition is [23018844;63440438], 40421595 sequence reads Rank 0: [4/34] /projects/olea/assembly/NGS/filteredReads/exp105_1_2.fastq -> partition is [63440439;103862033], 40421595 sequence reads Rank 0: [5/34] /projects/olea/assembly/NGS/filteredReads/exp115_4_ind1_1.fastq -> partition is [103862034;127351259], 23489226 sequence reads Rank 0: [6/34] /projects/olea/assembly/NGS/filteredReads/exp115_4_ind1_2.fastq -> partition is [127351260;150840485], 23489226 sequence reads Rank 0: [7/34] /projects/olea/assembly/NGS/filteredReads/exp115_4_ind2_1.fastq -> partition is [150840486;183768911], 32928426 sequence reads Rank 0: [8/34] /projects/olea/assembly/NGS/filteredReads/exp115_4_ind2_2.fastq -> partition is [183768912;216697337], 32928426 sequence reads Rank 0: [9/34] /projects/olea/assembly/NGS/filteredReads/exp115_5_ind1_1.fastq -> partition is [216697338;235303875], 18606538 sequence reads Rank 0: [10/34] /projects/olea/assembly/NGS/filteredReads/exp115_5_ind1_2.fastq -> partition is [235303876;253910413], 18606538 sequence reads Rank 0: [11/34] /projects/olea/assembly/NGS/filteredReads/exp115_5_ind2_1.fastq -> partition is [253910414;280830591], 26920178 sequence reads Rank 0: [12/34] /projects/olea/assembly/NGS/filteredReads/exp115_5_ind2_2.fastq -> partition is [280830592;307750769], 26920178 sequence reads Rank 0: [13/34] /projects/olea/assembly/NGS/filteredReads/exp115_8_ind2_1.fastq -> partition is [307750770;316667816], 8917047 sequence reads Rank 0: [14/34] /projects/olea/assembly/NGS/filteredReads/exp115_8_ind2_2.fastq -> partition is [316667817;325584863], 8917047 sequence reads Rank 0: [15/34] /projects/olea/assembly/NGS/filteredReads/exp115_8_ind3_1.fastq -> partition is [325584864;337795427], 12210564 sequence reads Rank 0: [16/34] /projects/olea/assembly/NGS/filteredReads/exp115_8_ind3_2.fastq -> partition is [337795428;350005991], 12210564 sequence reads Rank 0: [17/34] /projects/olea/assembly/NGS/filteredReads/exp117_1_1.fastq -> partition is [350005992;381443684], 31437693 sequence reads Rank 0: [18/34] /projects/olea/assembly/NGS/filteredReads/exp117_1_2.fastq -> partition is [381443685;412881377], 31437693 sequence reads Rank 0: [19/34] /projects/olea/assembly/NGS/filteredReads/exp117_2_1.fastq -> partition is [412881378;444289639], 31408262 sequence reads Rank 0: [20/34] /projects/olea/assembly/NGS/filteredReads/exp117_2_2.fastq -> partition is [444289640;475697901], 31408262 sequence reads Rank 0: [21/34] /projects/olea/assembly/NGS/filteredReads/exp117_3_ind1_1.fastq -> partition is [475697902;488606736], 12908835 sequence reads Rank 0: [22/34] /projects/olea/assembly/NGS/filteredReads/exp117_3_ind1_2.fastq -> partition is [488606737;501515571], 12908835 sequence reads Rank 0: [23/34] /projects/olea/assembly/NGS/filteredReads/exp117_3_ind2_1.fastq -> partition is [501515572;514875189], 13359618 sequence reads Rank 0: [24/34] /projects/olea/assembly/NGS/filteredReads/exp117_3_ind2_2.fastq -> partition is [514875190;528234807], 13359618 sequence reads Rank 0: [25/34] /projects/olea/assembly/NGS/filteredReads/exp117_4_ind1_1.fastq -> partition is [528234808;541341853], 13107046 sequence reads Rank 0: [26/34] /projects/olea/assembly/NGS/filteredReads/exp117_4_ind1_2.fastq -> partition is [541341854;554448899], 13107046 sequence reads Rank 0: [27/34] /projects/olea/assembly/NGS/filteredReads/exp117_4_ind2_1.fastq -> partition is [554448900;568062902], 13614003 sequence reads Rank 0: [28/34] /projects/olea/assembly/NGS/filteredReads/exp117_4_ind2_2.fastq -> partition is [568062903;581676905], 13614003 sequence reads Rank 0: [29/34] /projects/olea/lanes/sequences/dna/454_SanMichele/G0OG3UG01.sff -> partition is [581676906;582362443], 685538 sequence reads Rank 0: [30/34] /projects/olea/lanes/sequences/dna/454_SanMichele/G1CMX8Y01.sff -> partition is [582362444;582940397], 577954 sequence reads Rank 0: [31/34] /projects/olea/lanes/sequences/dna/454_SanMichele/G1CMX8Y02.sff -> partition is [582940398;583394190], 453793 sequence reads Rank 0: [32/34] /projects/olea/lanes/sequences/dna/454_SanMichele/GWBHXVK04.sff -> partition is [583394191;583521268], 127078 sequence reads Rank 0: [33/34] /projects/olea/lanes/sequences/dna/454_SanMichele/GXHOTPW01.sff -> partition is [583521269;584053495], 532227 sequence reads Rank 0: [34/34] /projects/olea/lanes/sequences/dna/454_SanMichele/GXUOK6A01.sff -> partition is [584053496;584586920], 533425 sequence reads But then Out of 64 MPI ranks, only 56 MPI ranks completed the loading of their sequences. Question is, what are doing the remaining 8 ? I understand you use 64 MPI ranks and Infiniband. Is your setup like 8 computers connected with Infiniband with each computer having 8 Xeon cores or something like that. (Xeon Nehalem are popular !) The 8 missing MPI ranks are the numbers 8-15. These are still loading sequences. boisvert@oll-sboisvert:~/Desktop$ grep 00007 rayHybridOlea.o135758 |grep Proc Rank 8: Rank= 8 Size= 64 ProcessIdentifier= 13261 ProcessorName= n00007 Rank 9: Rank= 9 Size= 64 ProcessIdentifier= 13262 ProcessorName= n00007 Rank 10: Rank= 10 Size= 64 ProcessIdentifier= 13263 ProcessorName= n00007 Rank 12: Rank= 12 Size= 64 ProcessIdentifier= 13265 ProcessorName= n00007 Rank 13: Rank= 13 Size= 64 ProcessIdentifier= 13266 ProcessorName= n00007 Rank 11: Rank= 11 Size= 64 ProcessIdentifier= 13264 ProcessorName= n00007 Rank 14: Rank= 14 Size= 64 ProcessIdentifier= 13267 ProcessorName= n00007 Rank 15: Rank= 15 Size= 64 ProcessIdentifier= 13268 ProcessorName= n00007 They are all on n00007. Are the nodes shared with other users -- is Ray running on n00007 while other programs from other users also run on n00007 ? In my experience, this is the biggest source of MPI job failures. My guess is that n00007 is paging because of another user oversubscribing n00007. You can avoid that altogether by configuring your cluster scheduler on a per-node basis. How do you launch your job on the cluster ? With Sun Grid Engine ? If so, allocation_rule should be 8, not round_robin or fill_up as these will cause a lot of failing jobs due to some users oversubscribing the nodes. With Portable Batch System ? Sébastien ________________________________________ De : Francesco Vezzi [ve...@appliedgenomics.org] Date d'envoi : 23 août 2011 10:09 À : Sébastien Boisvert Objet : Re: RE : Ray behavior Dear Seb in attachment you can find the standard error and standard output produced by Ray. I forget to redirected them so I was oblige to kill the application in order to give them a look. It seems that Ray was processing the reads. Francesco ----- Original Message ----- From: "Sébastien Boisvert" <sebastien.boisver...@ulaval.ca> To: "Francesco Vezzi" <ve...@appliedgenomics.org> Cc: denovoassembler-us...@lists.sf.net Sent: Tuesday, August 23, 2011 3:38:02 PM Subject: RE : Ray behavior Hello Francesco, What is the content of the standard output and standard error ? > ________________________________________ > De : Francesco Vezzi [ve...@appliedgenomics.org] > Date d'envoi : 23 août 2011 09:15 > À : Sébastien Boisvert > Objet : Ray behavior > > Dear Sebastien > I`m using Ray in order to assembly an high coverage of the olive genome > (~50X) composed mainly by Illumina reads and some unpaired 454 reads. > I lunched Ray a week ago on our cluster composed by 10 nodes with 8 cores > each connected through infinyband. On a dataset 4 times smaller, using the > same hardware, I was able to obtain an assembly in 3 days with Abyss. > > The strange behavior is the fact that after one week the folder where the > program is running contains only 3 files: RayOutput.NetworkTest.txt, > RayOutput.RayCommand.txt and, RayOutput.RayVersion.txt > Moreover, while all the used cpus are running at 100%, the amount of used RAM > in the nodes is exceptionally low (less than 5GB RAM per node). > > In your experience, is this behavior normal or the program is likely to be > stuck in some deadlock? > > Best > Francesco > > > Sébastien Boisvert http://github.com/sebhtml/ray ------------------------------------------------------------------------------ Get a FREE DOWNLOAD! and learn more about uberSVN rich system, user administration capabilities and model configuration. Take the hassle out of deploying and managing Subversion and the tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2 _______________________________________________ Denovoassembler-users mailing list Denovoassembler-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/denovoassembler-users