[crossbow-discuss] Interrupt Fanout

Nicolas Michael Thu, 04 Oct 2007 01:28:45 PDT

Hello,

I'm posting this as the follow-up to the private mail discussion I already had 
with Sunay and Roamer.


I'll start with a description of what we are doing. We are working in 
2-node-cluster configurations with large SPARC servers, e.g. SunFire 6900 
clusters. Now we are moving to Maramba and M9000 clusters as the next hardware 
generation for our application. Currently we have an M9000 cluster in our labs 
for prototyping, equipped with 32 SPARC64-VI "Olympus" CPUs (2 cores x 2 
threads) per node (128 ways per node). Currently we use 4 physical 1 GBit links 
as the cluster interconnect provided by SunCluster. Later on we plan to go to 2 
links with 10 GBit each. The driver used for these NICs is nxge. The throughput 
on all links of the cluster interconnect is roughly 100.000 pkts/sec per 
direction. Most of this load is generated by 8 processes with 1 connection each.

In our first tests we used Intel NICs with e1000g driver (4 links). In this 
setup, the interrupt load on the 4 CPUs doing the interrupt handling for the 4 
NICs became a bottleneck: Therefore we've fenced those 4 cpus into a processor 
set (actually we fenced both strands of each core since the strands are sharing 
the interrupt logic), but the cpus are still almost saturated doing interrupt 
processing (although we already tuned some parameters with ndd). Nevertheless, 
interrupt fencing has just been a work-around for the prototyping. In a 
real-world setup with our high HA requirements we won't be able to implement 
interrupt fencing with affordable effort (think of error situations where one 
cpu fails and interrupts move to different cpus and we have to re-bind all our 
processes and create new processor sets and so on...). 

So what we need is interrupt fanout to many cpus to avoid the need of interrupt 
fencing. What makes the situation more difficult is that we are running many 
processes with rather large heaps on the system. In order to get cache misses 
down and be able to scale on such a large 128-way server, we are explicitly 
binding the application threads of these processes to *all* 128 virtual cpus in 
the system. Threads bound to cpus which are doing heavy interrupt processing 
will starve. We therefore need to get interrupt processing spread out to so 
many cpus that even threads bound to interrupt cpus get enough cpu time.

What I've learned so far is that the e1000g driver is not able to fanout 
interrupts to different cpus, but nxge is. So we will from now on use 1 GBit 
and 10 GBit NICs with nxge driver. Currently we are re-installing our cluster 
with such a setup.

I've also learned that until the project Interrupt Resource Management (IRM) or 
Crossbow are available, the level of fanout can be controlled through the 
ddi_msix_alloc_limit global variable. We need some kind of work-around using 
fanout right now for evaluating whether this approach suits our needs and need 
a product solution by the end of this year, since our customers are already 
waiting for the new server generation.

Since we have 128 virutal cpus in the system, I believe we need a fanout over 
at least 16 or 32 cpus in order to get the interrupt processing load down 
enough to be able to still bind threads to these cpus.

After the discussion with Roamer and Sunay, and after reading your document 
"Hardware Resources Management and Virualization", I see the following problems 
for our setup:

- Fanout to Rx Rings is done on a per-connection basis, based on L3/L4 
classifiers. Currently we have 8 high-load connections and some more low-load 
connections. To fanout to 32 cpus, we would need at least 32 (probably better 
64 or more) connections. Sunay already pointed out that even now 3 or 4 cpus 
are used to move a packet through the stack (interrupt/polling, soft ring, 
squeue). However, from our tests so far I believe that interrupt processing is 
what is hurting us most. Setting ip_squeue_fanout and ip_soft_rings_cnt didn't 
help us, so I expect we won't come through with 8 connections. I'm currently in 
discussion with our platform developers on how to best increase the number of 
connections.

- While we already have some ideas how to solve problem #1, the next problem 
really seems to destroy everything again: Sunay wrote, that from the possible 
L3/L4 classifiers, nxge only uses the source IP address as a classifier for 
fanout. Since we are talking about the traffic on our cluster interconnect, we 
always have the same source address for *all* our traffic: It's all inter-node 
traffic, always coming from the other node of the cluster! This means no matter 
how many connections we use, they will always be mapped to the same cpu since 
all have the same source IP address! (Since we will use 2 or 4 NICs as the 
interconnect for reasons of redundancy, they will be mapped to 2 or 4 cpus -- 
but not to 16 or 32). We will need some kind of solution for this! Do you have 
any ideas about what could be a solution for this? E.g., do you plan to extend 
nxge to also consider source & destination port as classifiers for fanout?

We would also be very interested in taking part in a Crossbow beta test in 
November. Most important for us in this test would be that there is some kind 
of solution for problem #2.

Thanks for the time for reading all this, and also thanks for your support so 
far!

Nick.
 
 
This message posted from opensolaris.org

[crossbow-discuss] Interrupt Fanout

Reply via email to