Perfect, that worked, thanks again!

Quite a big difference, with k=31 and de Bruijn routing it took 9 hours, 
whereas k=41 without routing specified took 1:30h on a Cray XE6 system with 
1024 cores and ~250M pairs. 

Best regards,
Ino

From: ino.debru...@scilifelab.se
To: sebastien.boisver...@ulaval.ca
Date: Fri, 5 Apr 2013 12:32:42 +0200
CC: denovoassembler-users@lists.sourceforge.net
Subject: Re: [Denovoassembler-users] Long execution time, seems to be stuck at 
Rank 0




Alright, I'll try that then, thank you!

> Date: Thu, 4 Apr 2013 20:23:13 -0400
> From: sebastien.boisver...@ulaval.ca
> To: ino.debru...@scilifelab.se
> CC: denovoassembler-users@lists.sourceforge.net
> Subject: Re: [Denovoassembler-users] Long execution time, seems to be stuck 
> at Rank 0
> 
> I think you should run Ray with a different value for
> -read-write-checkpoints just to rule out bad checkpoints . Maybe your 
> checkpoints were
> generated with a different number of ranks.
> 
> On 04/04/13 04:26 AM, Ino de Bruijn wrote:
> > Hi,
> >
> > Hypothesis 0. I have 249,559,758 pairs in total so it would seem Ray is 
> > running on one MPI rank.
> >
> > Hypothesis 1. If I look at the SequencePartition.txt it seems the reads are 
> > properly partitioned:
> >
> > $ head -2 metassemble/assemblies/ray/out_41/SequencePartition.txt
> > #Rank   FirstSequence   LastSequence    NumberOfSequences
> > 0       0       243709  243710
> >
> > $ tail -1 metassemble/assemblies/ray/out_41/SequencePartition.txt
> > 1023    249315330       249559757       244428
> >
> > If I look at the log it also finds checkpoints for all different ranks. But 
> > maybe it uses the old routing
> > I specified? Or is that not stored in the checkpoints?
> >
> > Best regards,
> > Ino
> >
> >  > Date: Wed, 3 Apr 2013 11:10:27 -0400
> >  > From: sebastien.boisver...@ulaval.ca
> >  > To: ino.debru...@scilifelab.se
> >  > CC: denovoassembler-users@lists.sourceforge.net
> >  > Subject: Re: [Denovoassembler-users] Long execution time, seems to be 
> > stuck at Rank 0
> >  >
> >  > Hello,
> >  >
> >  > On 02/04/13 04:37 AM, Ino de Bruijn wrote:
> >  > > Dear Sébastien Biosvert,
> >  > >
> >  > > I tried a run without specifying the route-messages parameter but the 
> > problem remains the same.
> >  > >
> >  > > At the end only Rank 0 is outputting data:
> >  > >
> >  > > Speed RAY_SLAVE_MODE_ADD_EDGES 4066 units/second
> >  > > Estimated remaining time for this step: 12 hours, 36 minutes, 45 
> > seconds
> >  > > Rank 0 is adding edges [64950001/249559758]
> >  >
> >  > 249559758 is the number of reads for MPI rank 0. And you have 1024 such 
> > MPI ranks.
> >  >
> >  > There are two alternative hypotheses:
> >  >
> >  > Hypothesis 0. You have 1024 * 249559758 = 255 549 192 192 reads, which 
> > is a lot !
> >  >
> >  > Hypothesis 1: Ray is actually running on 1 MPI rank.
> >  >
> >  >
> >  > You can check the facts to validate one of the hypothesis with this 
> > command:
> >  >
> >  >
> >  > cat out_41/SequencePartition.txt
> >  >
> >  >
> >  >
> >  > I think that the problem is that you are re-using old checkpoints that 
> > were not generated with
> >  > 1024 MPI ranks.
> >  >
> >  > >
> >  > > The command I used was:
> >  > >
> >  > > mpiexec -n 1024 Ray \
> >  > > -k \
> >  > > 41 \
> >  > > -i \
> >  > > metassemble/assemblies/ray/pair.fastq \
> >  > > -o \
> >  > > metassemble/assemblies/ray/out_41 \
> >  > > -read-write-checkpoints \
> >  > > metassemble/assemblies/ray/out_41.cp
> >  > >
> >  > > on a Cray XE6 system: 
> > http://www.pdc.kth.se/resources/computers/lindgren/hardware.
> >  > >
> >  > > Best regards,
> >  > > Ino
> >  > >
> >  > >
> >  > > 
> > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------!
> >  > ---
> >  > > From: ino...@hotmail.com
> >  > > To: sebastien.boisver...@ulaval.ca
> >  > > CC: denovoassembler-users@lists.sourceforge.net
> >  > > Subject: RE: [Denovoassembler-users] Long execution time, seems to be 
> > stuck at Rank 0
> >  > > Date: Tue, 26 Mar 2013 00:26:05 +0100
> >  > >
> >  > >
> >  > > Ah excellent! My knowledge of these kind of systems is clearly 
> > lacking. Thanks a lot!
> >  > >
> >  > > Best regards,
> >  > > Ino
> >  > >
> >  > > > Date: Mon, 25 Mar 2013 17:44:12 -0400
> >  > > > From: sebastien.boisver...@ulaval.ca
> >  > > > To: ino...@hotmail.com
> >  > > > CC: denovoassembler-users@lists.sourceforge.net
> >  > > > Subject: Re: [Denovoassembler-users] Long execution time, seems to 
> > be stuck at Rank 0
> >  > > >
> >  > > > On 25/03/13 12:46 PM, Ino de Bruijn wrote:
> >  > > > > Thanks a lot for the prompt reply!
> >  > > > >
> >  > > > > > What is your interconnect ?
> >  > > > > >
> >  > > > > > Do you have Infiniband ?
> >  > > > >
> >  > > > > It is a Cray XE6 system that uses the Cray Gemini interconnect 
> > technology. These are the full specs: 
> > http://www.pdc.kth.se/resources/computers/lindgren/hardware
> >  > > >
> >  > > > In that case, you don't need to route your messages !
> >  > > >
> >  > > > The Cray XE6 has the best interconnect out there ! It's a 5D torus.
> >  > > >
> >  > > > >
> >  > > > > Is the polytope connection type good for this type of interconnect 
> > as well?
> >  > > > >
> >  > > >
> >  > > > You should remove the -route-messages option altogether.
> >  > > >
> >  > > >
> >  > > > -route-messages is useful when using buggy Infiniband fabrics or TCP 
> > networks. If you use something like:
> >  > > >
> >  > > > * Cray XE6
> >  > > > * IBM Blue Gene/Q
> >  > > > * Intel PSM (QLogic Infiniband)
> >  > > > * IBM iDataPlex
> >  > > >
> >  > > >
> >  > > > you don't need this option because these systems are really good and 
> > provide low-latency any-to-any message passing.
> >  > > >
> >  > > >
> >  > > > > Best regards,
> >  > > > > Ino
> >  > > > >
> >  > > > > > Date: Mon, 25 Mar 2013 11:24:21 -0400
> >  > > > > > From: sebastien.boisver...@ulaval.ca
> >  > > > > > To: denovoassembler-users@lists.sourceforge.net
> >  > > > > > Subject: Re: [Denovoassembler-users] Long execution time, seems 
> > to be stuck at Rank 0
> >  > > > > >
> >  > > > > > On 25/03/13 05:32 AM, Ino de Bruijn wrote:
> >  > > > > > > Dear Sébastien Boisvert,
> >  > > > > > >
> >  > > > > > > I am trying to assemble a paired-end Illumina Hiseq library of 
> > about 1 billion reads. I ran Ray with:
> >  > > > > > >
> >  > > > > > > mpiexec -n 1024 Ray \
> >  > > > > > > -k \
> >  > > > > > > 31 \
> >  > > > > > > -i \
> >  > > > > > > metassemble/assemblies/ray/pair.fastq \
> >  > > > > > > -o \
> >  > > > > > > metassemble/assemblies/ray/out_31 \
> >  > > > > > > -read-write-checkpoints \
> >  > > > > > > metassemble/assemblies/ray/out_31.cp \
> >  > > > > > > -route-messages
> >  > > > > >
> >  > > > > > Without other arguments, -route-messages will use a de Bruijn 
> > graph for routing, which is not really good.
> >  > > > > >
> >  > > > > > What is your interconnect ?
> >  > > > > >
> >  > > > > > Do you have Infiniband ?
> >  > > > > >
> >  > > > > >
> >  > > > > > Use this instead (the polytope is the best routing engine in 
> > RayPlatform):
> >  > > > > >
> >  > > > > > -route-messages -connection-type polytope -routing-graph-degree 
> > 62
> >  > > > > >
> >  > > > > > (from 
> > https://github.com/sebhtml/ray/blob/master/Documentation/Routing.txt )
> >  > > > > >
> >  > > > > > >
> >  > > > > > > For k=31 the assembly succeeds in ~9h on 1,024 cores. If I try 
> > higher values of k (i.e. {41..81..10}), the run
> >  > > > > > > is exited by the scheduler after a day (max run time is one 
> > day). If I look at the log of the stdout it seems
> >  > > > > > > like only Rank 0 is doing something at the end. Here are are a 
> > couple of lines from the output:
> >  > > > > > >
> >  > > > > >
> >  > > > > > Well, you are using a de Bruijn graph for routing your messages. 
> > A de Bruijn graph is theoretically cool for routing messages,
> >  > > > > > but in practice it's very bad because it's not adaptative and 
> > it's just a pit containing so many choke points.
> >  > > > > >
> >  > > > > > From the manual 
> > https://github.com/sebhtml/ray/blob/master/MANUAL_PAGE.txt :
> >  > > > > >
> >  > > > > > -connection-type type
> >  > > > > > Sets the connection type for routes.
> >  > > > > > Accepted values are debruijn, hypercube, polytope, group, 
> > random, kautz and complete. Default is debruijn.
> >  > > > > >
> >  > > > > >
> >  > > > > >
> >  > > > > > > Rank 0 is counting k-mers in sequence reads 
> > [51200001/249559758]
> >  > > > > > > Speed RAY_SLAVE_MODE_ADD_VERTICES 4909 units/second
> >  > > > > > > Estimated remaining time for this step: 11 hours, 13 minutes, 
> > 27 seconds
> >  > > > > > >
> >  > > > > > > This keeps going while only Rank 0 is outputting. The final 
> > message says there are 30 minutes left for k=41. For 51 and 61 it is around 
> > 10-20h left and for k=71 and k=81 it is about an hour again.
> >  > > > > >
> >  > > > > > You should definitely use the polytope. It has no choke points, 
> > and routes are adaptative (i.e. messages between A and B will use several 
> > paths).
> >  > > > > >
> >  > > > > > > Does it only use Rank 0 at this step because this step can 
> > only be done by one core or is the graph that Rank 0 contains highly 
> > complex or something?
> >  > > > > >
> >  > > > > > At 1024 MPI ranks, rank 0 is one of the hubs in a de Bruijn 
> > graph.
> >  > > > > >
> >  > > > > > >
> >  > > > > > > If I want to continue running Ray. Can I resume the process by 
> > running the same parameters, but using only one core (-n 1)? Or
> >  > > > > > > should I use more cores?
> >  > > > > >
> >  > > > > > You have to re-launch Ray with the same command except the -o 
> > parameter.
> >  > > > > >
> >  > > > > > Example:
> >  > > > > >
> >  > > > > > mpiexec -n 1024 Ray \
> >  > > > > > -k \
> >  > > > > > 31 \
> >  > > > > > -i \
> >  > > > > > metassemble/assemblies/ray/pair.fastq \
> >  > > > > > -o \
> >  > > > > > metassemble/assemblies/ray/out_31 \
> >  > > > > > -read-write-checkpoints \
> >  > > > > > metassemble/assemblies/ray/out_31.cp \
> >  > > > > > -route-messages -connection-type polytope -routing-graph-degree 
> > 62
> >  > > > > >
> >  > > > > >
> >  > > > > > > When is the checkpointing done?
> >  > > > > >
> >  > > > > > At each step.
> >  > > > > >
> >  > > > > > To see your checkpoint files:
> >  > > > > >
> >  > > > > > ls metassemble/assemblies/ray/out_31.cp | less
> >  > > > > >
> >  > > > > >
> >  > > > > >
> >  > > > > > >
> >  > > > > > > It seems like I have to remove the output dir to resume from a 
> > checkpoint. Is that correct?
> >  > > > > >
> >  > > > > > No. It's not necessary. You can instead provide a new output 
> > directory.
> >  > > > > >
> >  > > > > > >
> >  > > > > > > Best regards,
> >  > > > > > > Ino de Bruijn
> >  > > > > >
> >  > > > > >
> >  > > > > >
> >  > > > > >
> >  > > > > > 
> > ------------------------------------------------------------------------------
> >  > > > > > Everyone hates slow websites. So do we.
> >  > > > > > Make your web apps faster with AppDynamics
> >  > > > > > Download AppDynamics Lite for free today:
> >  > > > > > http://p.sf.net/sfu/appdyn_d2d_mar
> >  > > > > > _______________________________________________
> >  > > > > > Denovoassembler-users mailing list
> >  > > > > > Denovoassembler-users@lists.sourceforge.net
> >  > > > > > 
> > https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
> >  > > >
> >  > > >
> >  > > > 
> > ------------------------------------------------------------------------------
> >  > > > Own the Future-Intel® Level Up Game Demo Contest 2013
> >  > > > Rise to greatness in Intel's independent game demo contest.
> >  > > > Compete for recognition, cash, and the chance to get your game
> >  > > > on Steam. $5K grand prize plus 10 genre and skill prizes.
> >  > > > Submit your demo by 6/6/13. http://p.sf.net/sfu/intel_levelupd2d
> >  > > > _______________________________________________
> >  > > > Denovoassembler-users mailing list
> >  > > > Denovoassembler-users@lists.sourceforge.net
> >  > > > https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
> >  >
> 
                                          

------------------------------------------------------------------------------
Minimize network downtime and maximize team effectiveness.
Reduce network management and security costs.Learn how to hire 
the most talented Cisco Certified professionals. Visit the 
Employer Resources Portal
http://www.cisco.com/web/learning/employer_resources/index.html
_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users              
                          
------------------------------------------------------------------------------
Minimize network downtime and maximize team effectiveness.
Reduce network management and security costs.Learn how to hire 
the most talented Cisco Certified professionals. Visit the 
Employer Resources Portal
http://www.cisco.com/web/learning/employer_resources/index.html
_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Reply via email to