Hi Sébastien,

I've run two network-only tests on 1024 cores (64 nodes, 16 cores ea., QDR IB) 
using rc-4. One test is with "complete" routing (i.e. routing turned off), and 
the second using the default debruijn. The speed up in latency is significant:

# AverageForAllRanks: 1081.61
# StandardDeviation: 35.2018

versus with debruijn -route-messages

# AverageForAllRanks: 280.065
# StandardDeviation: 47.1641

I have the same two tests but with 2048 cores (128 nodes) queued up, and will 
report back once they've finished.

Note that this was running on a shared cluster that had other jobs, but these 
two tests were run in quick succession.

Thanks,

Ben

> Hello,
> 
> If you are running Ray (or plan to) on a large number of cores, you 
> might be interested
> in a new feature available in the development tree of Ray.
> 
> This feature is a new option called -route-messages.
> 
> In Ray, any core can sends a message directly to any other core 
> including itself.
> 
> For example, if you run Ray on 512 cores (let's say 64 computers with
> 8 cores each), then each core has 511 connections -- one with each other 
> core.
> 
> This means that each core has to check for incoming messages in a 
> round-robin
> fashion for all the 512 cores (this includes itself).
> 
> In this setting, the communication network is complete with
> 512 cores and 130816 connections (512 * 511 / 2).
> 
> One way to avoid such a huge number of connections is to allow each core to
> communicate directly with only a few others. To do so, we can take the 
> logarithm
> in base 2 of the number of cores to get the average number of 
> connections of a core
> 
> log2(512)=9.
> 
> Considering that we want any core to have 9 connections on average, we 
> need to
> select randomly 512*9 / 2 connections from the 130816 connections in order
> to build the random graph.
> 
> Such a random graph has 512 cores and an average number of connections of
> 9 and has exactly 2304 edges (512*9/2).
> There are many such graphs but it is easy to pick up one.
> 
> In this case, each core has to check for incoming messages in a round-robin
> fashion for all the ~9+1 connections (+1 to include itself).
> 
> There is also less memory utilised for incoming buffers.
> 
> And the length of the shortest route between any pair of cores in this 
> random graph is,
> on average, 3 connections.
> 
> This is because there are 9 first neighbors, 81 second neighbors and
> 729 third neighbors (which are redundant).
> 
> But the main motivation is that the latency is reduced by 60 %.
> 
> 
> The latency without this routing with random graphs:
> 
> 386 microseconds (standard deviation: 9)
> 
> 
> The latency with this routing with random graphs:
> 
> 158 microseconds (standard deviation: 15)
> 
> 
> 
> If anyone would like to share its experience with Ray on a large number
> of cores, go ahead.
> 
> 
> More detailed post on the Open-MPI list (more technical):
> 
> http://www.open-mpi.org/community/lists/users/2011/11/17737.php
> 
> 
> Happy assembly.
> 
> 
> Sébastien Boisvert
> http://boisvert.info


------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Reply via email to