Hi Christian,

Sorry for this very delayed answer.

At 03:16 27.03.2007, Christian Bell wrote:
I can't type, 482 was indeed a typo.  But still, I wouldn't look at
the absolute numbers "as is" since the single-node base case has
different performance.  Since 1x2x1 is our only common base case and
since Scali is faster at 4212 versus 4863, the IB interconect you're
testing should be achieving 416s instead of 550s to produce strong
scaling similar in line with the 8x2x2 InfiniPath time to solution
(at 482s).

Well, you do know Amdahl vs. Gustavson, right? The dataset is fixed, elapsed time includes initialization, write of animation files and more. Hence, slower per node performance would
_scale_ better.

For this application field, crash worthiness testing, most users keep the number of cores constant throughout the duration of a project (12 - 18 mnths). This due to numerical stability and verification thereof. Hence, the interesting point is not how far and fast you could run, but the cost of the system capable of running the application instances at 60-80% parallel efficiency.

As to the RMDA vs. MP based interconnect semantics, the problem I am phasing is that the RDMA interconnect I am using is more or less collapsing using 32 cores. Using alltoall with 1k packet size, it actually perform worse than Gbe. Sigh! (And please, do not turn this into a vendor harassment, as I am pretty sure this has to do with implementation and not architecture). So, what I have shown is that an RDMA interconnect performs faster than a message passing interconnect which has roughly 3x lower latency and 20x (?) higher message rate upto a scaling point where the RDMA _implementation_ collapses. And this _despite_ the fact the RDMA based MPI has to perform the MPI message matching.


With equal metrics/performance and phrased in this manner, it seems
that RDMA still has to implement the semantics that message-passing
already provides, which suggests in this case that the RDMA interface
is at a loss.  Maybe I'm missing something to your question...

I doubt you're missing anything;-) But let me stress that as the number of cores per node scale, a message passing semantics HCA with message matching in the HCA will have a constant message matching rate. An RDMA based MPI which uses the cores for message matching, the message matching rate would be almost proportional to the number of cores...

Håkon



--
Håkon Bugge
CTO
dir. +47 22 62 89 72
mob. +47 92 48 45 14
fax. +47 22 62 89 51
[EMAIL PROTECTED]
Skype: hakon_bugge

Scali - http://www.scali.com
Scaling the Linux Datacenter


_______________________________________________
Beowulf mailing list, [EMAIL PROTECTED]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to