Dear Sébastien Biosvert,
I tried a run without specifying the route-messages parameter but the problem
remains the same.
At the end only Rank 0 is outputting data:
Speed RAY_SLAVE_MODE_ADD_EDGES 4066 units/second
Estimated remaining time for this step: 12 hours, 36 minutes, 45 seconds
Rank 0 is adding edges [64950001/249559758]
The command I used was:
mpiexec -n 1024 Ray \
-k \
41 \
-i \
metassemble/assemblies/ray/pair.fastq \
-o \
metassemble/assemblies/ray/out_41 \
-read-write-checkpoints \
metassemble/assemblies/ray/out_41.cp
on a Cray XE6 system:
http://www.pdc.kth.se/resources/computers/lindgren/hardware.
Best regards,
Ino
From: [email protected]
To: [email protected]
CC: [email protected]
Subject: RE: [Denovoassembler-users] Long execution time, seems to be stuck at
Rank 0
Date: Tue, 26 Mar 2013 00:26:05 +0100
Ah excellent! My knowledge of these kind of systems is clearly lacking. Thanks
a lot!
Best regards,Ino
> Date: Mon, 25 Mar 2013 17:44:12 -0400
> From: [email protected]
> To: [email protected]
> CC: [email protected]
> Subject: Re: [Denovoassembler-users] Long execution time, seems to be stuck
> at Rank 0
>
> On 25/03/13 12:46 PM, Ino de Bruijn wrote:
> > Thanks a lot for the prompt reply!
> >
> > > What is your interconnect ?
> > >
> > > Do you have Infiniband ?
> >
> > It is a Cray XE6 system that uses the Cray Gemini interconnect technology.
> > These are the full specs:
> > http://www.pdc.kth.se/resources/computers/lindgren/hardware
>
> In that case, you don't need to route your messages !
>
> The Cray XE6 has the best interconnect out there ! It's a 5D torus.
>
> >
> > Is the polytope connection type good for this type of interconnect as well?
> >
>
> You should remove the -route-messages option altogether.
>
>
> -route-messages is useful when using buggy Infiniband fabrics or TCP
> networks. If you use something like:
>
> * Cray XE6
> * IBM Blue Gene/Q
> * Intel PSM (QLogic Infiniband)
> * IBM iDataPlex
>
>
> you don't need this option because these systems are really good and provide
> low-latency any-to-any message passing.
>
>
> > Best regards,
> > Ino
> >
> > > Date: Mon, 25 Mar 2013 11:24:21 -0400
> > > From: [email protected]
> > > To: [email protected]
> > > Subject: Re: [Denovoassembler-users] Long execution time, seems to be
> > stuck at Rank 0
> > >
> > > On 25/03/13 05:32 AM, Ino de Bruijn wrote:
> > > > Dear Sébastien Boisvert,
> > > >
> > > > I am trying to assemble a paired-end Illumina Hiseq library of about 1
> > billion reads. I ran Ray with:
> > > >
> > > > mpiexec -n 1024 Ray \
> > > > -k \
> > > > 31 \
> > > > -i \
> > > > metassemble/assemblies/ray/pair.fastq \
> > > > -o \
> > > > metassemble/assemblies/ray/out_31 \
> > > > -read-write-checkpoints \
> > > > metassemble/assemblies/ray/out_31.cp \
> > > > -route-messages
> > >
> > > Without other arguments, -route-messages will use a de Bruijn graph for
> > routing, which is not really good.
> > >
> > > What is your interconnect ?
> > >
> > > Do you have Infiniband ?
> > >
> > >
> > > Use this instead (the polytope is the best routing engine in
> > RayPlatform):
> > >
> > > -route-messages -connection-type polytope -routing-graph-degree 62
> > >
> > > (from
> > https://github.com/sebhtml/ray/blob/master/Documentation/Routing.txt )
> > >
> > > >
> > > > For k=31 the assembly succeeds in ~9h on 1,024 cores. If I try higher
> > values of k (i.e. {41..81..10}), the run
> > > > is exited by the scheduler after a day (max run time is one day). If I
> > look at the log of the stdout it seems
> > > > like only Rank 0 is doing something at the end. Here are are a couple
> > of lines from the output:
> > > >
> > >
> > > Well, you are using a de Bruijn graph for routing your messages. A de
> > Bruijn graph is theoretically cool for routing messages,
> > > but in practice it's very bad because it's not adaptative and it's just
> > a pit containing so many choke points.
> > >
> > > From the manual
> > https://github.com/sebhtml/ray/blob/master/MANUAL_PAGE.txt :
> > >
> > > -connection-type type
> > > Sets the connection type for routes.
> > > Accepted values are debruijn, hypercube, polytope, group, random, kautz
> > and complete. Default is debruijn.
> > >
> > >
> > >
> > > > Rank 0 is counting k-mers in sequence reads [51200001/249559758]
> > > > Speed RAY_SLAVE_MODE_ADD_VERTICES 4909 units/second
> > > > Estimated remaining time for this step: 11 hours, 13 minutes, 27
> > seconds
> > > >
> > > > This keeps going while only Rank 0 is outputting. The final message
> > says there are 30 minutes left for k=41. For 51 and 61 it is around 10-20h
> > left and for k=71 and k=81 it is about an hour again.
> > >
> > > You should definitely use the polytope. It has no choke points, and
> > routes are adaptative (i.e. messages between A and B will use several
> > paths).
> > >
> > > > Does it only use Rank 0 at this step because this step can only be
> > done by one core or is the graph that Rank 0 contains highly complex or
> > something?
> > >
> > > At 1024 MPI ranks, rank 0 is one of the hubs in a de Bruijn graph.
> > >
> > > >
> > > > If I want to continue running Ray. Can I resume the process by running
> > the same parameters, but using only one core (-n 1)? Or
> > > > should I use more cores?
> > >
> > > You have to re-launch Ray with the same command except the -o parameter.
> > >
> > > Example:
> > >
> > > mpiexec -n 1024 Ray \
> > > -k \
> > > 31 \
> > > -i \
> > > metassemble/assemblies/ray/pair.fastq \
> > > -o \
> > > metassemble/assemblies/ray/out_31 \
> > > -read-write-checkpoints \
> > > metassemble/assemblies/ray/out_31.cp \
> > > -route-messages -connection-type polytope -routing-graph-degree 62
> > >
> > >
> > > > When is the checkpointing done?
> > >
> > > At each step.
> > >
> > > To see your checkpoint files:
> > >
> > > ls metassemble/assemblies/ray/out_31.cp | less
> > >
> > >
> > >
> > > >
> > > > It seems like I have to remove the output dir to resume from a
> > checkpoint. Is that correct?
> > >
> > > No. It's not necessary. You can instead provide a new output directory.
> > >
> > > >
> > > > Best regards,
> > > > Ino de Bruijn
> > >
> > >
> > >
> > >
> > >
> > ------------------------------------------------------------------------------
> > > Everyone hates slow websites. So do we.
> > > Make your web apps faster with AppDynamics
> > > Download AppDynamics Lite for free today:
> > > http://p.sf.net/sfu/appdyn_d2d_mar
> > > _______________________________________________
> > > Denovoassembler-users mailing list
> > > [email protected]
> > > https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
>
>
> ------------------------------------------------------------------------------
> Own the Future-Intel® Level Up Game Demo Contest 2013
> Rise to greatness in Intel's independent game demo contest.
> Compete for recognition, cash, and the chance to get your game
> on Steam. $5K grand prize plus 10 genre and skill prizes.
> Submit your demo by 6/6/13. http://p.sf.net/sfu/intel_levelupd2d
> _______________________________________________
> Denovoassembler-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
------------------------------------------------------------------------------
Own the Future-Intel(R) Level Up Game Demo Contest 2013
Rise to greatness in Intel's independent game demo contest. Compete
for recognition, cash, and the chance to get your game on Steam.
$5K grand prize plus 10 genre and skill prizes. Submit your demo
by 6/6/13. http://altfarm.mediaplex.com/ad/ck/12124-176961-30367-2
_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users