Re: [Denovoassembler-users] Running Ray on a very large number of cores

Sébastien Boisvert Wed, 15 Feb 2012 11:43:59 -0800

On Wed, 2012-02-15 at 13:42 -0500, Allen, Benjamin S wrote:
> Hi Sébastien, 
> 
> 
> > You can do better than 280 us.
> > Your hardware, is it Mellanox, QLogic or another one ?
> > 
> 
> 
> In this case Mellanox ConnectX.
>


We have the same on colosse at Université Laval.

> > 
> > Are you using OFED ?
> > 
> 
> 
> Yes.
> 
> 
> > So if you use 
> > 
> > mpiexec -n 1024 Ray -test-network-only \
> > -route-messages -connection-type debruijn \
> > -routing-graph-degree 4
> > 
> > 
> > Then you get 4^5 = 1024, each rank has 4 connections instead of 2
> > and
> > routes are at most 5 hops.
> 
> 
> Queued this test up.
> 
> 

Let me know if it enhances things.



You can also try this for even better results


mpiexec -n 992 Ray -test-network-only \
 -route-messages -connection-type kautz \
 -o Kautz-992-degree-31-diameter-2


Details:

992.routing:[GraphImplementationKautz::makeConnections] degree= 31
diameter= 2 base= 32 vertices= 992


You will get a lower latency and a very small standard deviation.

There will be 32 unused cores, but no unused node because ranks are
mapped to cores in a round-robin fashion.

> > 
> > 
> > 
> > But with the Kautz graph, it is harder to guess a number of cores.
> > 
> > One thing you can do is to allocate, for instance, 4096 cores with
> > the
> > scheduler and only use 4032 with mpiexec.
> > 
> > 64*63 is 4032 (degree= 63, diameter= 2, base= 64, vertices= 4032),
> > but
> > that is a lot of cores !
> > 
> > Instead of having 4032 direct connections, each rank has exactly 63
> > connections and any rank can reach any other rank from the 4032
> > ranks in
> > at most 2 hops.
> > 
> > With 1 hop: 63 ranks reachable
> > With 2 hops: 3969 ranks reachable
> > 
> > The "1 hop" set and the "2 hops" set are disjoint.
> > 
> > 63 + 3969 = 4032 ranks.
> > 
> 
> 
> With the formula for kautz, n=(k+1)*k^(d-1), what values does one use
> for k and d?

N is the number of available cores.

n is the number of ranks you want to use in a Kautz tribe.

Obviously, n <= N

k is the degree of any vertex in the Kautz graph to be utilised.

The associated base is k+1.

d is the diameter of the graph.


>  I have a cluster that has 480 cores (24 / node) that I'd like to test
> this on.
> 

There is no Kautz graph with exactly 480 vertices.

However, there is a Kautz graph with 420 vertices and another with 462
vertices.

Details:

[GraphImplementationKautz::makeConnections] degree= 20 diameter= 2 base=
21 vertices= 420
[GraphImplementationKautz::makeConnections] degree= 21 diameter= 2 base=
22 vertices= 462


You can try:

available cores, N= 480
ranks, n= 462
routing graph degree, k= 21
routing graph diameter, d= 21

ask for 480 cores from the scheduler 
and run this:


mpiexec -n 462 Ray -test-network-only \
 -route-messages -connection-type kautz \
 -o Kautz-462-degree-21-diameter-2


RayPlatform should pick up the degree value of 21.

You have 20 nodes with 24 cores each.
There will be 18 unused cores.

Since most MPI flavours (I know Open-MPI does this) maps ranks on cores
in a round-robin fashion, exactly 18 nodes will have exactly 1 core that
does nothing.



You can try graph types with the stand-alone program graph_main in the
unit-tests directory.

To build it, you need to run 'bash graph_main.sh' inside unit-tests
first.

Then, you can use it like that:


./graph_main 462 kautz 1 1

where 462 is the number of vertices you want to test (without running
any MPI job) and where kautz is the desired type.

The two other arguments are the desired degree and the on-disk
verbosity. Providing 1 for the degree is safe because RayPlatform will
find a way.

> 
> What does the "group" connection-type do?
> 

The group type is not really good. de Bruijn and Kautz are the best.

Basically, only one rank on any node is allowed to communicate with
other ranks outside the node.

We implemented this to check is having only 1 UNIX process that can
write and read on the Infiniband Host Communication Adaptor has any
effect on the latency. In the end, the tests based on only this graph
type were not conclusive because communication events were not balanced
(producer-consumer problems).

The group type is thus really bad as nothing is balanced.


The other type is random, this one was the first to be implemented.


The random type simply creates (degree * number of ranks / 2 ) edges.

It is also not very good because it does not scale in space. The reason
is that it needs to store a routing table because the next rank to relay
to can not be computed only from the final destination, the original
source and the current rank.

On the other hand, debruijn and kautz are really good because they don't
need to pre-compute or store a routing table, which saves a lot of
space.

Message routing without a routing table is really better in every
aspects.

For instance, the de Bruijn graph of degree 64 and of diameter 4 has
16777216 vertices.

In a MPI collective with 16777216 ranks using de Bruijn routing with
degree 64, any rank can reach any other rank with a route of at most 4
edges. And there is no routing table. All routing operation are
O(diameter), which is quite constant in my book.

You can stack up a few more ranks using the corresponding Kautz grapĥ
that has the same diameter (diameter: 4) and the same degree (degree:
64) -- this Kautz graph has 17039360 vertices. This is an increase of
262144 vertices.

Note that this Kautz graph is a sub graph of the de Bruijn graph of
degree 65 and of diameter 4, which has 17850625 vertices.

> 
> Thanks,
> 

Thank you in your keen interest in message routing !

> 
> Ben




------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Re: [Denovoassembler-users] Running Ray on a very large number of cores

Reply via email to