Hi again Christian,
At 16:59 26.03.2007, Christian Bell wrote:
Hi Håkon,
I'm unsure if i would call significant a
submission comparing results between
configurations not compared at scale (in
appearance large versus small switch, much
heavier shared-memory component at small process
counts). For example, in your submitted
configurations, the interconnect communication
(inter-node) is never involved more than shared
memory (intra-node) and when the interconnect
does become dominant at 32 procs, that's when InfiniPath is faster.
Not sure how you count this. In my "world", all
processes communicates with more remote processes
that local ones in all cases except for the
single node runs. I.e., in a two node case with 2
or 4 processes per node, a process has 1 or 3
other local processes and 2 or 4 other remote
processes. Excluding the single node cases, we
have six runs (2x2, 4x2, 8x2, 2x4, 4x4, 8x4)
where RDMA is faster than message passing in 5 of the cases.
As to the 32 core case, I am running equal fast
as Infinipath on this one, but this is not a
released product (yet). Hence I haven't published it.
And based on this I did not call it significant
findings, but merely an indication of RDMA being
faster (upto 16 cores) or equal fast as message
passing for _this_ application and dataset.
On the flip side, you're right that these
results show the importance of an MPI
implementation (at least for shared memory),
which also means your product is well positioned
for the next generation of node configurations
in this regard. However, because of the node
configurations and because this is really one
benchmark, I can't take these results as
indicative of general interconnect
performance. Oh, and because you're forcing me
to compare results on this table, I now see what
Patrick at Myricom was saying -- the largest
config you show that stresses the interconnect
(8x2x2) takes 596s walltime on a similar
Mellanox DDR and 452s walltime on InfiniPath SDR
(yes, the pipe is "100%" smaller but the performance is 25% better).
Just to avoid any confusion, the 596s number is
_not_ with Scali MPI Connect (SMC), but a
competing MPI implementation. SMC achieves 551s
using SDR. I must admit your Infinipath number is
new to me, as topcrunch reports 482s for this configuration with Infinipath.
We have performance engineers who gather this
type of data and who've seen these trends on
other benchmarks, and they'll be happy to right
any wrong misconceptions, I'm certain.
Now I feel like I'm sticking my tongue out like
a shameless vendor and yet my original
discussion is not really about beating the
InfiniPath drum, which your reply insinuates.
Well, my intent was to draw the wulfers attention
to some published facts containing
apples-to-apples comparisons, in an interesting
discussion of RDMA vs. message passing. Given the
significant (yes, I mean it) difference in
latency and message rates, I was indeed
surprised. My question still is; if there existed
an RDMA API with similar characteristics as the
best message passing APIs, how would a good MPI implementation perform?
Håkon
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf