Hi Gilad, Thank you for the personal attack that came, apparently without even reading the email I sent. Brian asked about why the publicly available, independently run MPI2007 results from HP were worse on a particular than the Cambridge cluster MPI2007 results. I talked about three contributing factors to that. If you have other reasons you want to put forward, please do so based on data, rather than engaging in a blatant ad hominem attack.
If you want to engage in a marketing war, there are venues with which to do it, but I think on the Beowulf mailing list data and coherent thought are probably more appropriate. -Kevin On Fri, 2007-07-20 at 10:43, Gilad Shainer wrote: > Dear Kevin, > > You continue to set world records in providing misleading information. > You had previously compared Mellanox based products on dual single-core > machines to the "InfiniPath" adapter on dual dual-core machines and > claim that with InfiniPath there are more Gflops.... This latest release > follow the same lines... > > Unlike QLogic InfiniPath adapters, Mellanox provide different InfiniBand > HCA silicon and adapters. There are 4 different silicon chips, each with > different size, different power, different price and different > performance. There is the PCI-X device (InfiniHost), the single-port > device that was deigned for best price/performance (InfiniHost III Lx), > the dual-port device that was designed for best performance (InfiniHost > III Ex) and the new ConnectX device that was designed to extend the > performance capabilities of the dual port device. Each device provide > different price and performance points (did I said different?). > > The SPEC results that you are using for Mellanox, are of the single port > device. And even that device (that its list price is probably half of > your InfiniPath) had better results with 8 server nodes than yours.... > Your comparison of InfiniPath to the Mellanox single-port device should > have been on price/performance and not on performance. Now, if you want > to really compare performance to performance, why don't you use the dual > port device, or even better, ConnectX? Well... I will do it for you. > Every time I had compared my performance adapters to yours, your > adapters did not even come close... > > > Gilad. > > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > On Behalf Of Kevin Ball > Sent: Thursday, July 19, 2007 11:52 AM > To: Brian Dobbins > Cc: [email protected] > Subject: Re: [Beowulf] MPI2007 out - strange pop2 results? > > Hi Brian, > > The benchmark 121.pop2 is based on a code that was already important > to QLogic customers before the SPEC MPI2007 suite was released (POP, > Parallel Ocean Program), and we have done a fair amount of analysis > trying to understand its performance characteristics. There are three > things that stand out in performance analysis on pop2. > > The first point is that it is a very demanding code on the compiler. > There has been a fair amount of work on pop2 by the PathScale compiler > team, and the fact that the Cambridge submission used the PathScale > compiler while the HP submission used the Intel compiler accounts for > some (the serial portion) of the advantage at small core counts, though > scalability should not be affected by this. > > The second point is that pop2 is fairly demanding of IO. Another > example to look at for this is in comparing the AMD Emerald Cluster > results to the Cambridge results; the Emerald cluster is using NFS over > GigE from a single server/disk, while Cambridge has a much more > optimized IO subsystem. While on some results Emerald scales better, > for pop2 it scales only from 3.71 to 15.0 (4.04X) while Cambridge scales > from 4.29 to 21.0 (4.90X). The HP system appears to be using NFS over > DDR IB from a single server with a RAID; thus it should fall somewhere > between Emerald and Cambridge in this regard. > > The first two points account for some of the difference, but by no > means all. The final one is probably the most crucial. The code pop2 > uses a communication pattern consisting of many small/medium sized > (between 512 bytes and 4k) point to point messages punctuated by > periodic tiny (8b) allreduces. The QLogic InfiniPath architecture > performs far better in this regime than the Mellanox InfiniHost > architecture. > > This is consistent with what we have seen in other application > benchmarking; even SDR Infiniband based off of the QLogic InfiniPath > architecture performs in general as well as DDR Infiniband based on the > Mellanox InfiniHost architecture, and in some cases better. > > > Full disclosure: I work for QLogic on the InfiniPath product line. > > -Kevin > > > On Wed, 2007-07-18 at 18:50, Brian Dobbins wrote: > > Hi guys, > > > > Greg, thanks for the link! It will no doubt take me a little while > > to parse all the MPI2007 info (even though there are only a few > > submitted results at the moment!), but one of the first things I > > noticed was that performance of pop2 on the HP blade system was beyond > > > atrocious... any thoughts on why this is the case? I can't see any > > logical reason for the scaling they have, which (being the first thing > > > I noticed) makes me somewhat hesitant to put much stock into the > > results at the moment. Perhaps this system is just a statistical blip > > > on the radar which will fade into noise when additional results are > > posted, but until that time, it'd be nice to know why the results are > > the way they are. > > > > To spell it out a bit, the reference platform is at 1 (ok, 0.994) on > > 16 cores, but then the HP blade system at 16 cores is at 1.94. Not > > bad there. However, moving up we have: > > 32 cores - 2.36 > > 64 cores - 2.02 > > 128 cores - 2.14 > > 256 cores - 3.62 > > > > So not only does it hover at 2.x for a while, but then going from > > 128 -> 256 it gets a decent relative improvement. Weird. > > On the other hand, the Cambridge system (with the same processors > > and a roughly similar interconnect, it seems) has the follow scaling > > from 32->256 cores: > > > > 32 cores - 4.29 > > 64 cores - 7.37 > > 128 cores - 11.5 > > 256 cores - 15.4 > > > > ... So, I'm mildly confused as to the first results. Granted, > > different compilers are being used, and presumably there are other > > differences, too, but I can't see how -any- of them could result in > > the scores the HP system got. Any thoughts? Anyone from HP (or > > QLogic) care to comment? I'm not terribly knowledgeable about the MPI > > 2007 suite yet, unfortunately, so maybe I'm just overlooking > > something. > > > > Cheers, > > - Brian > > > > > > ______________________________________________________________________ > > _______________________________________________ > > Beowulf mailing list, [email protected] To change your subscription > > (digest mode or unsubscribe) visit > > http://www.beowulf.org/mailman/listinfo/beowulf > > _______________________________________________ > Beowulf mailing list, [email protected] To change your subscription > (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, [email protected] To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
