On 08/04/13 04:36 AM, Ino de Bruijn wrote: > Perfect, that worked, thanks again! > > Quite a big difference, with k=31 and de Bruijn routing it took 9 hours, > whereas k=41 without routing specified took 1:30h on a Cray XE6 system with > 1024 cores and ~250M pairs. >
Yes. The Cray XE6 is really a fast computer. And its Gemini interconnect provides low latency that allows messages in Ray to travel faster from one point to another. > Best regards, > Ino > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > From: ino.debru...@scilifelab.se > To: sebastien.boisver...@ulaval.ca > Date: Fri, 5 Apr 2013 12:32:42 +0200 > CC: denovoassembler-users@lists.sourceforge.net > Subject: Re: [Denovoassembler-users] Long execution time, seems to be stuck > at Rank 0 > > Alright, I'll try that then, thank you! > > > > Date: Thu, 4 Apr 2013 20:23:13 -0400 > > From: sebastien.boisver...@ulaval.ca > > To: ino.debru...@scilifelab.se > > CC: denovoassembler-users@lists.sourceforge.net > > Subject: Re: [Denovoassembler-users] Long execution time, seems to be > stuck at Rank 0 > > > > I think you should run Ray with a different value for > > -read-write-checkpoints just to rule out bad checkpoints . Maybe your > checkpoints were > > generated with a different number of ranks. > > > > On 04/04/13 04:26 AM, Ino de Bruijn wrote: > > > Hi, > > > > > > Hypothesis 0. I have 249,559,758 pairs in total so it would seem Ray is > running on one MPI rank. > > > > > > Hypothesis 1. If I look at the SequencePartition.txt it seems the reads > are properly partitioned: > > > > > > $ head -2 metassemble/assemblies/ray/out_41/SequencePartition.txt > > > #Rank FirstSequence LastSequence NumberOfSequences > > > 0 0 243709 243710 > > > > > > $ tail -1 metassemble/assemblies/ray/out_41/SequencePartition.txt > > > 1023 249315330 249559757 244428 > > > > > > If I look at the log it also finds checkpoints for all different ranks. > But maybe it uses the old routing > > > I specified? Or is that not stored in the checkpoints? > > > > > > Best regards, > > > Ino > > > > > > > Date: Wed, 3 Apr 2013 11:10:27 -0400 > > > > From: sebastien.boisver...@ulaval.ca > > > > To: ino.debru...@scilifelab.se > > > > CC: denovoassembler-users@lists.sourceforge.net > > > > Subject: Re: [Denovoassembler-users] Long execution time, seems to be > stuck at Rank 0 > > > > > > > > Hello, > > > > > > > > On 02/04/13 04:37 AM, Ino de Bruijn wrote: > > > > > Dear Sébastien Biosvert, > > > > > > > > > > I tried a run without specifying the route-messages parameter but > the problem remains the same. > > > > > > > > > > At the end only Rank 0 is outputting data: > > > > > > > > > > Speed RAY_SLAVE_MODE_ADD_EDGES 4066 units/second > > > > > Estimated remaining time for this step: 12 hours, 36 minutes, 45 > seconds > > > > > Rank 0 is adding edges [64950001/249559758] > > > > > > > > 249559758 is the number of reads for MPI rank 0. And you have 1024 > such MPI ranks. > > > > > > > > There are two alternative hypotheses: > > > > > > > > Hypothesis 0. You have 1024 * 249559758 = 255 549 192 192 reads, which > is a lot ! > > > > > > > > Hypothesis 1: Ray is actually running on 1 MPI rank. > > > > > > > > > > > > You can check the facts to validate one of the hypothesis with this > command: > > > > > > > > > > > > cat out_41/SequencePartition.txt > > > > > > > > > > > > > > > > I think that the problem is that you are re-using old checkpoints that > were not generated with > > > > 1024 MPI ranks. > > > > > > > > > > > > > > The command I used was: > > > > > > > > > > mpiexec -n 1024 Ray \ > > > > > -k \ > > > > > 41 \ > > > > > -i \ > > > > > metassemble/assemblies/ray/pair.fastq \ > > > > > -o \ > > > > > metassemble/assemblies/ray/out_41 \ > > > > > -read-write-checkpoints \ > > > > > metassemble/assemblies/ray/out_41.cp > > > > > > > > > > on a Cray XE6 system: > http://www.pdc.kth.se/resources/computers/lindgren/hardware. > > > > > > > > > > Best regards, > > > > > Ino > > > > > > > > > > > > > > > > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------! > > > > --- > > > > > From: ino...@hotmail.com > > > > > To: sebastien.boisver...@ulaval.ca > > > > > CC: denovoassembler-users@lists.sourceforge.net > > > > > Subject: RE: [Denovoassembler-users] Long execution time, seems to > be stuck at Rank 0 > > > > > Date: Tue, 26 Mar 2013 00:26:05 +0100 > > > > > > > > > > > > > > > Ah excellent! My knowledge of these kind of systems is clearly > lacking. Thanks a lot! > > > > > > > > > > Best regards, > > > > > Ino > > > > > > > > > > > Date: Mon, 25 Mar 2013 17:44:12 -0400 > > > > > > From: sebastien.boisver...@ulaval.ca > > > > > > To: ino...@hotmail.com > > > > > > CC: denovoassembler-users@lists.sourceforge.net > > > > > > Subject: Re: [Denovoassembler-users] Long execution time, seems to > be stuck at Rank 0 > > > > > > > > > > > > On 25/03/13 12:46 PM, Ino de Bruijn wrote: > > > > > > > Thanks a lot for the prompt reply! > > > > > > > > > > > > > > > What is your interconnect ? > > > > > > > > > > > > > > > > Do you have Infiniband ? > > > > > > > > > > > > > > It is a Cray XE6 system that uses the Cray Gemini interconnect > technology. These are the full specs: > http://www.pdc.kth.se/resources/computers/lindgren/hardware > > > > > > > > > > > > In that case, you don't need to route your messages ! > > > > > > > > > > > > The Cray XE6 has the best interconnect out there ! It's a 5D torus. > > > > > > > > > > > > > > > > > > > > Is the polytope connection type good for this type of > interconnect as well? > > > > > > > > > > > > > > > > > > > You should remove the -route-messages option altogether. > > > > > > > > > > > > > > > > > > -route-messages is useful when using buggy Infiniband fabrics or > TCP networks. If you use something like: > > > > > > > > > > > > * Cray XE6 > > > > > > * IBM Blue Gene/Q > > > > > > * Intel PSM (QLogic Infiniband) > > > > > > * IBM iDataPlex > > > > > > > > > > > > > > > > > > you don't need this option because these systems are really good > and provide low-latency any-to-any message passing. > > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > Ino > > > > > > > > > > > > > > > Date: Mon, 25 Mar 2013 11:24:21 -0400 > > > > > > > > From: sebastien.boisver...@ulaval.ca > > > > > > > > To: denovoassembler-users@lists.sourceforge.net > > > > > > > > Subject: Re: [Denovoassembler-users] Long execution time, > seems to be stuck at Rank 0 > > > > > > > > > > > > > > > > On 25/03/13 05:32 AM, Ino de Bruijn wrote: > > > > > > > > > Dear Sébastien Boisvert, > > > > > > > > > > > > > > > > > > I am trying to assemble a paired-end Illumina Hiseq library > of about 1 billion reads. I ran Ray with: > > > > > > > > > > > > > > > > > > mpiexec -n 1024 Ray \ > > > > > > > > > -k \ > > > > > > > > > 31 \ > > > > > > > > > -i \ > > > > > > > > > metassemble/assemblies/ray/pair.fastq \ > > > > > > > > > -o \ > > > > > > > > > metassemble/assemblies/ray/out_31 \ > > > > > > > > > -read-write-checkpoints \ > > > > > > > > > metassemble/assemblies/ray/out_31.cp \ > > > > > > > > > -route-messages > > > > > > > > > > > > > > > > Without other arguments, -route-messages will use a de Bruijn > graph for routing, which is not really good. > > > > > > > > > > > > > > > > What is your interconnect ? > > > > > > > > > > > > > > > > Do you have Infiniband ? > > > > > > > > > > > > > > > > > > > > > > > > Use this instead (the polytope is the best routing engine in > RayPlatform): > > > > > > > > > > > > > > > > -route-messages -connection-type polytope > -routing-graph-degree 62 > > > > > > > > > > > > > > > > (from > https://github.com/sebhtml/ray/blob/master/Documentation/Routing.txt ) > > > > > > > > > > > > > > > > > > > > > > > > > > For k=31 the assembly succeeds in ~9h on 1,024 cores. If I > try higher values of k (i.e. {41..81..10}), the run > > > > > > > > > is exited by the scheduler after a day (max run time is one > day). If I look at the log of the stdout it seems > > > > > > > > > like only Rank 0 is doing something at the end. Here are are > a couple of lines from the output: > > > > > > > > > > > > > > > > > > > > > > > > > Well, you are using a de Bruijn graph for routing your > messages. A de Bruijn graph is theoretically cool for routing messages, > > > > > > > > but in practice it's very bad because it's not adaptative and > it's just a pit containing so many choke points. > > > > > > > > > > > > > > > > From the manual > https://github.com/sebhtml/ray/blob/master/MANUAL_PAGE.txt : > > > > > > > > > > > > > > > > -connection-type type > > > > > > > > Sets the connection type for routes. > > > > > > > > Accepted values are debruijn, hypercube, polytope, group, > random, kautz and complete. Default is debruijn. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Rank 0 is counting k-mers in sequence reads > [51200001/249559758] > > > > > > > > > Speed RAY_SLAVE_MODE_ADD_VERTICES 4909 units/second > > > > > > > > > Estimated remaining time for this step: 11 hours, 13 > minutes, 27 seconds > > > > > > > > > > > > > > > > > > This keeps going while only Rank 0 is outputting. The final > message says there are 30 minutes left for k=41. For 51 and 61 it is around > 10-20h left and for k=71 and k=81 it is about an hour again. > > > > > > > > > > > > > > > > You should definitely use the polytope. It has no choke > points, and routes are adaptative (i.e. messages between A and B will use > several paths). > > > > > > > > > > > > > > > > > Does it only use Rank 0 at this step because this step can > only be done by one core or is the graph that Rank 0 contains highly complex > or something? > > > > > > > > > > > > > > > > At 1024 MPI ranks, rank 0 is one of the hubs in a de Bruijn > graph. > > > > > > > > > > > > > > > > > > > > > > > > > > If I want to continue running Ray. Can I resume the process > by running the same parameters, but using only one core (-n 1)? Or > > > > > > > > > should I use more cores? > > > > > > > > > > > > > > > > You have to re-launch Ray with the same command except the -o > parameter. > > > > > > > > > > > > > > > > Example: > > > > > > > > > > > > > > > > mpiexec -n 1024 Ray \ > > > > > > > > -k \ > > > > > > > > 31 \ > > > > > > > > -i \ > > > > > > > > metassemble/assemblies/ray/pair.fastq \ > > > > > > > > -o \ > > > > > > > > metassemble/assemblies/ray/out_31 \ > > > > > > > > -read-write-checkpoints \ > > > > > > > > metassemble/assemblies/ray/out_31.cp \ > > > > > > > > -route-messages -connection-type polytope > -routing-graph-degree 62 > > > > > > > > > > > > > > > > > > > > > > > > > When is the checkpointing done? > > > > > > > > > > > > > > > > At each step. > > > > > > > > > > > > > > > > To see your checkpoint files: > > > > > > > > > > > > > > > > ls metassemble/assemblies/ray/out_31.cp | less > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It seems like I have to remove the output dir to resume from > a checkpoint. Is that correct? > > > > > > > > > > > > > > > > No. It's not necessary. You can instead provide a new output > directory. > > > > > > > > > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > Ino de Bruijn > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------------ > > > > > > > > Everyone hates slow websites. So do we. > > > > > > > > Make your web apps faster with AppDynamics > > > > > > > > Download AppDynamics Lite for free today: > > > > > > > > http://p.sf.net/sfu/appdyn_d2d_mar > > > > > > > > _______________________________________________ > > > > > > > > Denovoassembler-users mailing list > > > > > > > > Denovoassembler-users@lists.sourceforge.net > > > > > > > > > https://lists.sourceforge.net/lists/listinfo/denovoassembler-users > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------------ > > > > > > Own the Future-Intel® Level Up Game Demo Contest 2013 > > > > > > Rise to greatness in Intel's independent game demo contest. > > > > > > Compete for recognition, cash, and the chance to get your game > > > > > > on Steam. $5K grand prize plus 10 genre and skill prizes. > > > > > > Submit your demo by 6/6/13. http://p.sf.net/sfu/intel_levelupd2d > > > > > > _______________________________________________ > > > > > > Denovoassembler-users mailing list > > > > > > Denovoassembler-users@lists.sourceforge.net > > > > > > https://lists.sourceforge.net/lists/listinfo/denovoassembler-users > > > > > > > > ------------------------------------------------------------------------------ > Minimize network downtime and maximize team effectiveness. Reduce network > management and security costs.Learn how to hire the most talented Cisco > Certified professionals. Visit the Employer Resources Portal > http://www.cisco.com/web/learning/employer_resources/index.html > _______________________________________________ Denovoassembler-users mailing > list Denovoassembler-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/denovoassembler-users ------------------------------------------------------------------------------ Minimize network downtime and maximize team effectiveness. Reduce network management and security costs.Learn how to hire the most talented Cisco Certified professionals. Visit the Employer Resources Portal http://www.cisco.com/web/learning/employer_resources/index.html _______________________________________________ Denovoassembler-users mailing list Denovoassembler-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/denovoassembler-users