Hello, I'm posting this as the follow-up to the private mail discussion I already had with Sunay and Roamer.
I'll start with a description of what we are doing. We are working in 2-node-cluster configurations with large SPARC servers, e.g. SunFire 6900 clusters. Now we are moving to Maramba and M9000 clusters as the next hardware generation for our application. Currently we have an M9000 cluster in our labs for prototyping, equipped with 32 SPARC64-VI "Olympus" CPUs (2 cores x 2 threads) per node (128 ways per node). Currently we use 4 physical 1 GBit links as the cluster interconnect provided by SunCluster. Later on we plan to go to 2 links with 10 GBit each. The driver used for these NICs is nxge. The throughput on all links of the cluster interconnect is roughly 100.000 pkts/sec per direction. Most of this load is generated by 8 processes with 1 connection each. In our first tests we used Intel NICs with e1000g driver (4 links). In this setup, the interrupt load on the 4 CPUs doing the interrupt handling for the 4 NICs became a bottleneck: Therefore we've fenced those 4 cpus into a processor set (actually we fenced both strands of each core since the strands are sharing the interrupt logic), but the cpus are still almost saturated doing interrupt processing (although we already tuned some parameters with ndd). Nevertheless, interrupt fencing has just been a work-around for the prototyping. In a real-world setup with our high HA requirements we won't be able to implement interrupt fencing with affordable effort (think of error situations where one cpu fails and interrupts move to different cpus and we have to re-bind all our processes and create new processor sets and so on...). So what we need is interrupt fanout to many cpus to avoid the need of interrupt fencing. What makes the situation more difficult is that we are running many processes with rather large heaps on the system. In order to get cache misses down and be able to scale on such a large 128-way server, we are explicitly binding the application threads of these processes to *all* 128 virtual cpus in the system. Threads bound to cpus which are doing heavy interrupt processing will starve. We therefore need to get interrupt processing spread out to so many cpus that even threads bound to interrupt cpus get enough cpu time. What I've learned so far is that the e1000g driver is not able to fanout interrupts to different cpus, but nxge is. So we will from now on use 1 GBit and 10 GBit NICs with nxge driver. Currently we are re-installing our cluster with such a setup. I've also learned that until the project Interrupt Resource Management (IRM) or Crossbow are available, the level of fanout can be controlled through the ddi_msix_alloc_limit global variable. We need some kind of work-around using fanout right now for evaluating whether this approach suits our needs and need a product solution by the end of this year, since our customers are already waiting for the new server generation. Since we have 128 virutal cpus in the system, I believe we need a fanout over at least 16 or 32 cpus in order to get the interrupt processing load down enough to be able to still bind threads to these cpus. After the discussion with Roamer and Sunay, and after reading your document "Hardware Resources Management and Virualization", I see the following problems for our setup: - Fanout to Rx Rings is done on a per-connection basis, based on L3/L4 classifiers. Currently we have 8 high-load connections and some more low-load connections. To fanout to 32 cpus, we would need at least 32 (probably better 64 or more) connections. Sunay already pointed out that even now 3 or 4 cpus are used to move a packet through the stack (interrupt/polling, soft ring, squeue). However, from our tests so far I believe that interrupt processing is what is hurting us most. Setting ip_squeue_fanout and ip_soft_rings_cnt didn't help us, so I expect we won't come through with 8 connections. I'm currently in discussion with our platform developers on how to best increase the number of connections. - While we already have some ideas how to solve problem #1, the next problem really seems to destroy everything again: Sunay wrote, that from the possible L3/L4 classifiers, nxge only uses the source IP address as a classifier for fanout. Since we are talking about the traffic on our cluster interconnect, we always have the same source address for *all* our traffic: It's all inter-node traffic, always coming from the other node of the cluster! This means no matter how many connections we use, they will always be mapped to the same cpu since all have the same source IP address! (Since we will use 2 or 4 NICs as the interconnect for reasons of redundancy, they will be mapped to 2 or 4 cpus -- but not to 16 or 32). We will need some kind of solution for this! Do you have any ideas about what could be a solution for this? E.g., do you plan to extend nxge to also consider source & destination port as classifiers for fanout? We would also be very interested in taking part in a Crossbow beta test in November. Most important for us in this test would be that there is some kind of solution for problem #2. Thanks for the time for reading all this, and also thanks for your support so far! Nick. This message posted from opensolaris.org