On Tuesday 13 November 2007, Bill Johnstone wrote: > I've been trying to quantify the performance differences between the > cluster running on the previous switch vs. the new one. I've been > using the Intel MPI Benchmarks (IMB) as well as IOzone in network mode > and also IOR. In the previous configuration, the 64-bit nodes had only > a single connection to the switch, and the MTU was 1500. Under the new > configuration, all nodes are now running with an MTU of 9000 and the > 64-bit nodes with the tg3s are set up with the Linux bonding driver to > form 802.3ad aggregated links using both ports per aggregate link. > I've not adjusted any sysctls or driver settings. The e1000 driver is > version 7.3.20-k2-NAPI as shipped with the Linux kernel.
Looking at the master node of a Rocks cluster during mass rebuilds (involving HTTP transfers), I can keep the output side of the master node's GigE link saturated (123 MB/s a good bit of the time) with MTU 1500. I've never encountered a need to increase the MTU, but I've also never done significant MPI over Ethernet (only Myrinet & IB). Don't know whether an MPI load would be helped by MTU 9000, but I'd not assume it would without actually measuring it. <snip> > In trying to understand this, I noticed that ifconfig listed something > like 2000 - 2500 dropped packets for the bonded interfaces on each > node. This was following a pass of IMB-MPI1 and IMB-EXT. The dropped > packet counts seem split roughly equally across the two bonded slave > interfaces. Am I correct in taking this to mean the incoming load on > the bonded interface was simply too high for the node to service all > the packets? I can also note that I tried both "layer2" and "layer3+4" > for the "xmit_hash_policy" bonding parameter, without any significant > difference. The switch itself uses only a layer2-based hash. I don't know what causes the 'ifconfig' dropped-packet counter to increment. I've seen syslog, using UDP on a central syslog server, get saturated and drop packets. What I really mean by that is: syslogd's socket receive buffer was routinely filling up whenever there was a deluge of messages from compute nodes. Whenever there is not enough room in an application's receive buffer for a new packet, the kernel will drop the packet, so some messages did not make it into syslogd, and therefore did not make it into the logfile on disk. I don't know if this form of packet drop causes ifconfig's dropped-packet counter to increment. When I looked into this specific problem a bit more, I discovered that syslogd uses the default socket buffer sizes, so the only way to change that (without making a one-line edit to syslogd's source and rebuilding, or using an alternative to ye olde syslogd) was to tune the kernel default socket receive buffer size: net.core.rmem_default = 8388608 (from sysctl.conf) This does not directly bear on your problem, but it might give you something to think about. > 1. What are general network/TCP tuning parameters, e.g. buffer sizes, > etc. that I should change or experiment with? For older kernels, and > especially with the 2.4 series, changing the socket buffer size was > recommended. However, various pieces of documentation such as > http://www.netapp.com/library/tr/3183.pdf indicate that the newer 2.6 > series kernels "auto-tune" these buffers. Is there still any benefit > to manually adjusting them? Standard ones to play with (from /proc/sys/net/core): rmem_default wmem_default rmem_max wmem_max (from /proc/sys/net/ipv4): tcp_rmem tcp_wmem I'm guessing you already knew about all those. :) UDP uses buffer sizes from core; TCP uses the ones in ipv4. I looked at your netapp URL, and couldn't confidently identify where it discusses "auto-tune". Perhaps it's talking about nfs (server or client?) auto-tuning? But perhaps the kernel doesn't auto-tune generally, for any old application, only for nfs? There was most definitely a need to manually tune in my syslog example above, using RHEL4 2.6.9-* kernels. > 2. For the e1000, using the Linux kernel version of the driver, what > are the relevant tuning parameters, and what have been your experiences > in trying various values? There are knobs for the interrupt throttling > rate, etc. but I'm not sure where to start. Gosh, I went through this once, but I don't have those results readily available to me now. I'm assuming you found a guide that goes into great detail about these tuning parameters, I think from Intel? > 3. For the tg3, again, what are the relevant tuning parameters, and > what have been your experiences in trying various values? I've found > it more difficult to find discussions for the "tunables" for tg3 as > compared to e1000. > > 4. What has been people's recent experience using the Linux kernel > bonding driver to do 802.3ad link aggregation? What kind of throughput > scaling have you folks seen, and what about processor load? Can't help you on either of these. > 5. What suggestions are there regarding trying to reduce the number of > dropped packets? Find the parameter, either in the kernel or in your application, that controls the socket receive buffer size for your application, and try increasing it. David _______________________________________________ Beowulf mailing list, [email protected] To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
