Re: [Denovoassembler-users] Running Ray on a very large number of cores

Sébastien Boisvert Tue, 14 Feb 2012 18:52:27 -0800

Hi !

On Tue, 2012-02-14 at 17:41 -0500, Ben Allen wrote:
> Hi Sébastien,
> 
> I've run two network-only tests on 1024 cores (64 nodes, 16 cores ea., QDR 
> IB) using rc-4.
>  One test is with "complete" routing (i.e. routing turned off),


Ok. I think that the Ray network test is useful to test interconnects.

> and the second using the 
> default debruijn. The speed up in latency is significant:
> 
> # AverageForAllRanks: 1081.61
> # StandardDeviation: 35.2018
> 
> versus with debruijn -route-messages
> 
> # AverageForAllRanks: 280.065
> # StandardDeviation: 47.1641

Impressive !

There are at least two reasons:

1.

Basically, each rank has less neighbours.

Since message reception is done in round-robin only for the neighbours,
it takes less time to go around.

2.

Obviously, it uses less memory which is desirable.




Ray uses RayPlatform as the parallel compute engine, which is LGPL and
available freely.

In RayPlatform, round-robin reception is used by default.

If you are interested in RayPlatform (the framework and the engine),
you can read the README:
https://github.com/sebhtml/RayPlatform/blob/master/README

There is a simple example here using it ->
https://github.com/sebhtml/RayPlatform-example


For instance, the network test you are using is simply a plugin
(interface CorePlugin) added in the core called ComputeCore, which is
the parallel compute engine.

See
https://github.com/sebhtml/ray/blob/master/code/plugin_NetworkTest/NetworkTest.cpp#L329


The code for registering a plugin in RayPlatform is similar to loading a
module in Linux in the sense that a CorePlugin requests resources from
the ComputeCore and the core engine can expose things from a plugin to
other plugins. Modular stuff.



You can do better than 280 us.
Your hardware, is it Mellanox, QLogic or another one ?

Are you using OFED ?


When using 

mpiexec -n 1024 Ray -test-network-only \
 -route-messages -connection-type debruijn

RayPlatform, the engine that Ray utilises, will try to find a degree
such as degree^n = 1024, starting with degree=2.

So in your case, 2*2*2*2*2*2*2*2*2*2 = 1024 = 2^10

This means that routes will have at most 10 edges (10 hops on hardware,
if you will).

You can explore these things already, in fact.

ls RayOutput/Routing

Connections.txt
RelayEvents.txt
Routes.txt
Summary.txt

You want to read Summary.txt first.

These files are generated regardless of the routing graph.


Increasing the degree will decrease the graph diameter, which is
desirable.


So if you use 

mpiexec -n 1024 Ray -test-network-only \
 -route-messages -connection-type debruijn \
 -routing-graph-degree 4


Then you get 4^5 = 1024, each rank has 4 connections instead of 2 and
routes are at most 5 hops.


Presently, RayPlatform supports at most 4096 ranks when routing is
enabled because routing information is stored in the MPI tag along with
the tag, which can not exceed 255 I think. 

By patching MPI, the limit could be removed I guess.



Let's say you have 1024 ranks and a degree of 4.

We can represent ranks as DNA words with {A,T,C,G}.

There are 4^5 = 1024 DNA words in this example.

There is an edge in the routing graph if two words overlap on 4 symbols.

ATCGA ->
 TCGAC


The de Bruijn graph is OK but suffers from a non-uniform distribution of
relay events.

A relay event occurs when a rank receives a message for which it is not
the final destination. The message is then relayed by the RayPlatform.


If you want to get even better latencies, you can try the kautz graph,
which is a lot better than the de Bruijn graph (Kautz graph is a
subgraph of a de Bruijn graph).


In the Kautz graph, the same rule as for the de Bruijn graph is used to
establish connections.

But in the Kautz graph, any word with 2 consecutive identical symbols
are not accepted.

So, for instance,

ATCGA is accepted.

But ATTGC is not.


Kautz graphs have the smallest diameter for any graph and relay events
are more uniform than for the de Bruijn graph.


For instance, the cluster system called SC5832 from the now defunct
SiCortex has 972 nodes and was based on the Kautz graph.

In the Kautz graph (in this example), ranks are words of 6 symbols using
an alphabet of size 4. And any two consecutive symbols in any word are
difference, otherwise they are not in the Kautz graph.

4*3*3*3*3*3 = 972


But with the Kautz graph, it is harder to guess a number of cores.

One thing you can do is to allocate, for instance, 4096 cores with the
scheduler and only use 4032 with mpiexec.

64*63 is 4032 (degree= 63, diameter= 2, base= 64, vertices= 4032), but
that is a lot of cores !

Instead of having 4032 direct connections, each rank has exactly 63
connections and any rank can reach any other rank from the 4032 ranks in
at most 2 hops.

With 1 hop: 63 ranks reachable
With 2 hops: 3969 ranks reachable

The "1 hop" set and the "2 hops" set are disjoint.

63 + 3969 = 4032 ranks.


Happy testing and routing !



                   Sébastien

> 
> I have the same two tests but with 2048 cores (128 nodes) queued up, and will 
> report back once they've finished.
> 
> Note that this was running on a shared cluster that had other jobs, but these 
> two tests were run in quick succession.
> 
> Thanks,
> 
> Ben
> 
> > Hello,
> > 
> > If you are running Ray (or plan to) on a large number of cores, you 
> > might be interested
> > in a new feature available in the development tree of Ray.
> > 
> > This feature is a new option called -route-messages.
> > 
> > In Ray, any core can sends a message directly to any other core 
> > including itself.
> > 
> > For example, if you run Ray on 512 cores (let's say 64 computers with
> > 8 cores each), then each core has 511 connections -- one with each other 
> > core.
> > 
> > This means that each core has to check for incoming messages in a 
> > round-robin
> > fashion for all the 512 cores (this includes itself).
> > 
> > In this setting, the communication network is complete with
> > 512 cores and 130816 connections (512 * 511 / 2).
> > 
> > One way to avoid such a huge number of connections is to allow each core to
> > communicate directly with only a few others. To do so, we can take the 
> > logarithm
> > in base 2 of the number of cores to get the average number of 
> > connections of a core
> > 
> > log2(512)=9.
> > 
> > Considering that we want any core to have 9 connections on average, we 
> > need to
> > select randomly 512*9 / 2 connections from the 130816 connections in order
> > to build the random graph.
> > 
> > Such a random graph has 512 cores and an average number of connections of
> > 9 and has exactly 2304 edges (512*9/2).
> > There are many such graphs but it is easy to pick up one.
> > 
> > In this case, each core has to check for incoming messages in a round-robin
> > fashion for all the ~9+1 connections (+1 to include itself).
> > 
> > There is also less memory utilised for incoming buffers.
> > 
> > And the length of the shortest route between any pair of cores in this 
> > random graph is,
> > on average, 3 connections.
> > 
> > This is because there are 9 first neighbors, 81 second neighbors and
> > 729 third neighbors (which are redundant).
> > 
> > But the main motivation is that the latency is reduced by 60 %.
> > 
> > 
> > The latency without this routing with random graphs:
> > 
> > 386 microseconds (standard deviation: 9)
> > 
> > 
> > The latency with this routing with random graphs:
> > 
> > 158 microseconds (standard deviation: 15)
> > 
> > 
> > 
> > If anyone would like to share its experience with Ray on a large number
> > of cores, go ahead.
> > 
> > 
> > More detailed post on the Open-MPI list (more technical):
> > 
> > http://www.open-mpi.org/community/lists/users/2011/11/17737.php
> > 
> > 
> > Happy assembly.
> > 
> > 
> > Sébastien Boisvert
> > http://boisvert.info
> 




------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Re: [Denovoassembler-users] Running Ray on a very large number of cores

Reply via email to