Hi again Christian,

At 16:59 26.03.2007, Christian Bell wrote:

Hi Håkon,

I'm unsure if i would call significant a submission comparing results between configurations not compared at scale (in appearance large versus small switch, much heavier shared-memory component at small process counts). For example, in your submitted configurations, the interconnect communication (inter-node) is never involved more than shared memory (intra-node) and when the interconnect does become dominant at 32 procs, that's when InfiniPath is faster.

Not sure how you count this. In my "world", all processes communicates with more remote processes that local ones in all cases except for the single node runs. I.e., in a two node case with 2 or 4 processes per node, a process has 1 or 3 other local processes and 2 or 4 other remote processes. Excluding the single node cases, we have six runs (2x2, 4x2, 8x2, 2x4, 4x4, 8x4) where RDMA is faster than message passing in 5 of the cases.

As to the 32 core case, I am running equal fast as Infinipath on this one, but this is not a released product (yet). Hence I haven't published it.

And based on this I did not call it significant findings, but merely an indication of RDMA being faster (upto 16 cores) or equal fast as message passing for _this_ application and dataset.

On the flip side, you're right that these results show the importance of an MPI implementation (at least for shared memory), which also means your product is well positioned for the next generation of node configurations in this regard. However, because of the node configurations and because this is really one benchmark, I can't take these results as indicative of general interconnect performance. Oh, and because you're forcing me to compare results on this table, I now see what Patrick at Myricom was saying -- the largest config you show that stresses the interconnect (8x2x2) takes 596s walltime on a similar Mellanox DDR and 452s walltime on InfiniPath SDR (yes, the pipe is "100%" smaller but the performance is 25% better).

Just to avoid any confusion, the 596s number is _not_ with Scali MPI Connect (SMC), but a competing MPI implementation. SMC achieves 551s using SDR. I must admit your Infinipath number is new to me, as topcrunch reports 482s for this configuration with Infinipath.

We have performance engineers who gather this type of data and who've seen these trends on other benchmarks, and they'll be happy to right any wrong misconceptions, I'm certain.

Now I feel like I'm sticking my tongue out like a shameless vendor and yet my original discussion is not really about beating the InfiniPath drum, which your reply insinuates.

Well, my intent was to draw the wulfers attention to some published facts containing apples-to-apples comparisons, in an interesting discussion of RDMA vs. message passing. Given the significant (yes, I mean it) difference in latency and message rates, I was indeed surprised. My question still is; if there existed an RDMA API with similar characteristics as the best message passing APIs, how would a good MPI implementation perform?


Håkon






_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to