Off-line someone asked me to clarify my earlier e-mail. Given this discussion continues, perhaps this might help explain the performance a bit more. The Max Payload Size quoted here is what is typically implemented on x86 chipsets though other chipsets may use a larger value. From a pure bandwidth perspective (which is not typical of many applications), this should be reasonable accurate. In any case, this is just a fyi.
A x4 IB 5 GT/s is 20 Gbps raw (customers do comprehend the marketing hype does not translate into that bandwidth being available for applications - I have had to explain this to the press in the past about how raw does equal application available bandwidth). Take off 8b/10b, protocol overheads, etc. and assuming a 2KB PMTU, then one can expect to hit perhaps 14-15 Gbps per direction depending upon the workload. Let's assume an aggregate of 30 Gbps of potential application bandwidth for simplicity. The PCIe x8 2.5 GT/s is 20 Gbps raw so take off the 8b/10b, protocol overheads, control / application overheads, etc. and given it uses at most a 256B Max Payload Size on DMA Writes and cache line sized DMA Read Completions (64B) though many people use PIO Writes to avoid DMA Reads when it comes to micro-benchmarks, the actual performance is unlikely to hit what IB might drive depending upon the direction and mix of control and application data transactions. Add in the impacts on memory controller which in real-world applications is servicing the processors quite a bit more than illustrated by micro-benchmarks and the ability of a system to drive an IB x4 DDR device at link rate is very questionable.
The question is whether this really matters. If you examine most workloads on various platforms, they simply cannot generate enough bandwidth to consume the external I/O bandwidth capacity. In many cases, they are constrained by the processor or the combination of the processor / memory components. This isn't a bad thing when you think about it. For many customers, it means that the attached I/O fabrics will be sufficiently provisioned to eliminate or largely mitigate the impacts of external fabric events, e.g. congestion, and deliver a reasonable solution using the existing hardware (issues of topology, use of multi-path, etc. all come into bearing as a function of fabric diameter). In the end, customers care about whether the application performs as expected and where the real bottlenecks lie. For most applications, it will come down to the processor / memory subsystems and not the I/O or external fabric.
While I haven't seen all of the latest DDR micro-benchmark results, I believe the x4 IB SDR numbers largely align with what I've outlined here.
Mike
At 02:09 AM 10/6/2006, john t wrote:
Hi Shannon,
The bandwidth figures that you quoted below match with my readings for single port Mellanox DDR HCA (both for unidirection and bidirection). So it seems dual port SDR HCA performs as good as single port DDR HCA. It would help if you can also tell the bandwidth that you got using one port of your dual-port SDR HCA card. Was it half the bandwidth that you stated below, which means having two SDR ports per HCA helps.
In my case it seems having two ports (DDR) per HCA does not increase BW, since PCI-e x8 limit is 16 Gb/sec per direction and each of the two HCA ports (DDR) though capable of transferring 16 Gb/sec in each direction, when used together can not go above 16 Gb/sec.
Regards,
John T.
On 10/5/06, Shannon V. Davidson <[EMAIL PROTECTED] > wrote:
- John,
- In our testing with dual port Mellanox SDR HCAs, we found that not all PCI-express implementations are equal. Depending on the PCIe chipset, we measured unidirectional SDR dual-rail bandwidth ranging from 1100-1500 MB/sec and bidirectional SDR dual-rail bandwidth ranging from 1570-2600 MB/sec. YMMV, but had good luck with Intel and Nvidia chipsets, and less success with the Broadcom Serverworks HT-1000 and HT-2000 chipsets. My last report (in June 2006) was that Broadcom was working to improve their PCI-express performance.
- Regards,
- Shannon
- john t wrote:
- Hi Bernard,
- I had a configuration issue. I fixed it and now I get same BW (i.e. around 10 Gb/sec) on each port provided I use ports on different HCA cards. If I use two ports of the same HCA card then BW gets divided between these two ports. I am using Mellanox HCA cards and doing simple send/recv using uverbs.
- Do you think it could be an issue with Mallanox driver or could it be due to system/PCI-E limitation.
- Regards,
- John T.
- On 10/3/06, Bernard King-Smith <[EMAIL PROTECTED] > wrote:
- John,
- Who's adapter (manufacturer) are you using? It is usually an adapter implementation or driver issue that occures when you cannot scale across multiple links. The fact that you don't scale up from one link, but it appears they share a fixed bandwidth across N links means that there is a driver or stack issue. At one time I think that IPoIB and maybe other IB drivers used only one event queue across multiple links which would be a bottleneck. We added code in the IBM EHCA driver to get round this bottleneck.
- Are your measurements using MPI or IP. Are you using separate tasks/sockets per link and using different subnets if using IP?
- Bernie King-Smith
- IBM Corporation
- Server Group
- Cluster System Performance
- [EMAIL PROTECTED] (845)433-8483
- Tie. 293-8483 or wombat2 on NOTES
- "We are not responsible for the world we are born into, only for the world we leave when we die.
- So we have to accept what has gone before us and work to change the only thing we can,
- -- The Future." William Shatner
- john t" < [EMAIL PROTECTED]> wrote on 10/03/2006 09:42:24 AM:
- >
- > Hi,
- >
- > I have two HCA cards, each having two ports and each connected to a
- > separate PCI-E x8 slot.
- >
- > Using one HCA port I get end to end BW of 11.6 Gb/sec (uni-direction RDMA).
- > If I use two ports of the same HCA or different HCA, I get between 5
- > to 6.5 Gb/sec point-to-point BW on each port. BW on each port
- > further reduces if I use more ports. I am not able to understand
- > this behaviour. Is there any limitation on max. BW that a system can
- > provide? Does the available BW get divided among multiple HCA ports
- > (which means having multiple ports will not increase the BW)?
- >
- >
- > Regards,
- > John T
- _______________________________________________
- openib-general mailing list
- [email protected]
- http://openib.org/mailman/listinfo/openib-general
- To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
--- ____________________________________________
- Shannon V. Davidson <[EMAIL PROTECTED]>
- Senior Software Engineer Raytheon
- 636-479-7465 office 443-383-0331 fax
- ____________________________________________
_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
_______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
