Hi ! On Tue, 2012-02-14 at 17:41 -0500, Ben Allen wrote: > Hi Sébastien, > > I've run two network-only tests on 1024 cores (64 nodes, 16 cores ea., QDR > IB) using rc-4. > One test is with "complete" routing (i.e. routing turned off),
Ok. I think that the Ray network test is useful to test interconnects. > and the second using the > default debruijn. The speed up in latency is significant: > > # AverageForAllRanks: 1081.61 > # StandardDeviation: 35.2018 > > versus with debruijn -route-messages > > # AverageForAllRanks: 280.065 > # StandardDeviation: 47.1641 Impressive ! There are at least two reasons: 1. Basically, each rank has less neighbours. Since message reception is done in round-robin only for the neighbours, it takes less time to go around. 2. Obviously, it uses less memory which is desirable. Ray uses RayPlatform as the parallel compute engine, which is LGPL and available freely. In RayPlatform, round-robin reception is used by default. If you are interested in RayPlatform (the framework and the engine), you can read the README: https://github.com/sebhtml/RayPlatform/blob/master/README There is a simple example here using it -> https://github.com/sebhtml/RayPlatform-example For instance, the network test you are using is simply a plugin (interface CorePlugin) added in the core called ComputeCore, which is the parallel compute engine. See https://github.com/sebhtml/ray/blob/master/code/plugin_NetworkTest/NetworkTest.cpp#L329 The code for registering a plugin in RayPlatform is similar to loading a module in Linux in the sense that a CorePlugin requests resources from the ComputeCore and the core engine can expose things from a plugin to other plugins. Modular stuff. You can do better than 280 us. Your hardware, is it Mellanox, QLogic or another one ? Are you using OFED ? When using mpiexec -n 1024 Ray -test-network-only \ -route-messages -connection-type debruijn RayPlatform, the engine that Ray utilises, will try to find a degree such as degree^n = 1024, starting with degree=2. So in your case, 2*2*2*2*2*2*2*2*2*2 = 1024 = 2^10 This means that routes will have at most 10 edges (10 hops on hardware, if you will). You can explore these things already, in fact. ls RayOutput/Routing Connections.txt RelayEvents.txt Routes.txt Summary.txt You want to read Summary.txt first. These files are generated regardless of the routing graph. Increasing the degree will decrease the graph diameter, which is desirable. So if you use mpiexec -n 1024 Ray -test-network-only \ -route-messages -connection-type debruijn \ -routing-graph-degree 4 Then you get 4^5 = 1024, each rank has 4 connections instead of 2 and routes are at most 5 hops. Presently, RayPlatform supports at most 4096 ranks when routing is enabled because routing information is stored in the MPI tag along with the tag, which can not exceed 255 I think. By patching MPI, the limit could be removed I guess. Let's say you have 1024 ranks and a degree of 4. We can represent ranks as DNA words with {A,T,C,G}. There are 4^5 = 1024 DNA words in this example. There is an edge in the routing graph if two words overlap on 4 symbols. ATCGA -> TCGAC The de Bruijn graph is OK but suffers from a non-uniform distribution of relay events. A relay event occurs when a rank receives a message for which it is not the final destination. The message is then relayed by the RayPlatform. If you want to get even better latencies, you can try the kautz graph, which is a lot better than the de Bruijn graph (Kautz graph is a subgraph of a de Bruijn graph). In the Kautz graph, the same rule as for the de Bruijn graph is used to establish connections. But in the Kautz graph, any word with 2 consecutive identical symbols are not accepted. So, for instance, ATCGA is accepted. But ATTGC is not. Kautz graphs have the smallest diameter for any graph and relay events are more uniform than for the de Bruijn graph. For instance, the cluster system called SC5832 from the now defunct SiCortex has 972 nodes and was based on the Kautz graph. In the Kautz graph (in this example), ranks are words of 6 symbols using an alphabet of size 4. And any two consecutive symbols in any word are difference, otherwise they are not in the Kautz graph. 4*3*3*3*3*3 = 972 But with the Kautz graph, it is harder to guess a number of cores. One thing you can do is to allocate, for instance, 4096 cores with the scheduler and only use 4032 with mpiexec. 64*63 is 4032 (degree= 63, diameter= 2, base= 64, vertices= 4032), but that is a lot of cores ! Instead of having 4032 direct connections, each rank has exactly 63 connections and any rank can reach any other rank from the 4032 ranks in at most 2 hops. With 1 hop: 63 ranks reachable With 2 hops: 3969 ranks reachable The "1 hop" set and the "2 hops" set are disjoint. 63 + 3969 = 4032 ranks. Happy testing and routing ! Sébastien > > I have the same two tests but with 2048 cores (128 nodes) queued up, and will > report back once they've finished. > > Note that this was running on a shared cluster that had other jobs, but these > two tests were run in quick succession. > > Thanks, > > Ben > > > Hello, > > > > If you are running Ray (or plan to) on a large number of cores, you > > might be interested > > in a new feature available in the development tree of Ray. > > > > This feature is a new option called -route-messages. > > > > In Ray, any core can sends a message directly to any other core > > including itself. > > > > For example, if you run Ray on 512 cores (let's say 64 computers with > > 8 cores each), then each core has 511 connections -- one with each other > > core. > > > > This means that each core has to check for incoming messages in a > > round-robin > > fashion for all the 512 cores (this includes itself). > > > > In this setting, the communication network is complete with > > 512 cores and 130816 connections (512 * 511 / 2). > > > > One way to avoid such a huge number of connections is to allow each core to > > communicate directly with only a few others. To do so, we can take the > > logarithm > > in base 2 of the number of cores to get the average number of > > connections of a core > > > > log2(512)=9. > > > > Considering that we want any core to have 9 connections on average, we > > need to > > select randomly 512*9 / 2 connections from the 130816 connections in order > > to build the random graph. > > > > Such a random graph has 512 cores and an average number of connections of > > 9 and has exactly 2304 edges (512*9/2). > > There are many such graphs but it is easy to pick up one. > > > > In this case, each core has to check for incoming messages in a round-robin > > fashion for all the ~9+1 connections (+1 to include itself). > > > > There is also less memory utilised for incoming buffers. > > > > And the length of the shortest route between any pair of cores in this > > random graph is, > > on average, 3 connections. > > > > This is because there are 9 first neighbors, 81 second neighbors and > > 729 third neighbors (which are redundant). > > > > But the main motivation is that the latency is reduced by 60 %. > > > > > > The latency without this routing with random graphs: > > > > 386 microseconds (standard deviation: 9) > > > > > > The latency with this routing with random graphs: > > > > 158 microseconds (standard deviation: 15) > > > > > > > > If anyone would like to share its experience with Ray on a large number > > of cores, go ahead. > > > > > > More detailed post on the Open-MPI list (more technical): > > > > http://www.open-mpi.org/community/lists/users/2011/11/17737.php > > > > > > Happy assembly. > > > > > > Sébastien Boisvert > > http://boisvert.info > ------------------------------------------------------------------------------ Virtualization & Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ _______________________________________________ Denovoassembler-users mailing list Denovoassembler-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/denovoassembler-users