http://www.linuxfoundation.org/collaborate/workgroups/networking/napi NAPI ("New API") is a modification to the device driver packet processing framework, which is designed to improve the performance of high-speed networking. NAPI works through:
NAPI was first incorporated in the 2.5/2.6 kernel but was also backported to the 2.4.20 kernel. Note that use of NAPI is entirely optional, drivers will work just
fine (though perhaps a little more slowly) without it.
The following is a whirlwind tour of what must be done to create a NAPI-compliant network driver. The first step is to make some changes to your driver's interrupt handler. If your driver has been interrupted because a new packet is available, that packet should not be processed at that time. Instead, your driver should disable any further "packet available" interrupts and tell the networking subsystem to poll your driver shortly to pick up all available packets. Disabling interrupts, of course, is a hardware-specific matter between the driver and the adaptor. Arranging for polling is done with a call to: void netif_rx_schedule(struct net_device *dev); An alternative form you'll see in some drivers is: if (netif_rx_schedule_prep(dev)) __netif_rx_schedule(dev); The end result is the same either way. (If netif_rx_schedule_prep() returns zero, it means that there was already a poll scheduled, and you should not have received another interrupt). The next step is to create a poll() method for your driver; it's job is to obtain packets from the network interface and feed them into the kernel. The poll() prototype is: int (*poll)(struct net_device *dev, int *budget); The poll() function should process all available incoming packets, much as your interrupt handler might have done in the pre-NAPI days. There are some exceptions, however:
int netif_receive_skb(struct sk_buff *skb);
void netif_rx_complete(struct net_device *dev); The networking subsystem promises that poll() will not be invoked simultaneously (for the same device) on multiple processors. The final step is to tell the networking subsystem about your poll() method. This, of course, is done in your initialization code when all the other struct net_device fields are set: dev->poll = my_poll; dev->weight = 16; The weight field is a measure of the importance of this interface; the number stored here will turn out to be the same number your driver finds in the quota field when poll() is called. If you forget to initialize weight and leave it at zero, poll() will never be called (voice of experience here). Gigabit adaptor drivers tend to set weight to 64; smaller values can be used for slower media. NAPI, however, requires the following features to be available:
NAPI processes packet events in what is known as dev->poll() method. Typically, only packet receive events are processed in dev->poll(). The rest of the events MAY be processed by the regular interrupt handler to reduce processing latency (justified also because there are not that many of them). Note, however, NAPI does not enforce that dev->poll() only
processes receive events. The example used in this document is to move the receive processing only to dev->poll(); this is shown with the patch for the tulip driver. For an example of code that moves all the interrupt driver to dev->poll() look at other drivers (tg3, e1000, sky2). There are caveats that might force you to go with moving everything to dev->poll(). Different NICs work differently depending on their status/event acknowledgement setup. There are two types of event register ACK mechanisms.
Can't seem to find any supported by Linux which do this.
This is a very important topic and appendix 2 is dedicated for more discussion.
For the rest of this text, we'll assume that dev->poll() only processes receive events.
NAPI provides an "inherent mitigation" which is bound by system capacity as can be seen from the following data collected by Robert Olsson's tests on Gigabit ethernet (e1000):
Legend:
Observe that when the NIC receives 890Kpackets/sec only 17 rx interrupts are generated. The system can't handle the processing at 1 interrupt/packet at that load level. At lower rates on the other hand, rx interrupts go up and therefore the interrupt/packet ratio goes up (as observable from that table). So there is possibility that under low enough input, you get one poll call for each input packet caused by a single interrupt each time. And if the system can't handle interrupt per packet ratio of 1, then it will just have to chug along.
NAPI usage does not have to be limited only to receiving packets. With many devices the poll() routine can also be used to manage transmit completion or PHY interface state changes. By moving this processing out of the hardware interrrupt service routine, there may be less latency and better performance. Most chips with flow control only send a pause packet when they run out of Rx buffers. Since packets are pulled off the DMA ring by a softirq in NAPI, if the system is slow in grabbing them and we have a high input rate (faster than the system's capacity to remove packets), then theoretically there will only be one rx interrupt for all packets during a given packetstorm. Under low load, we might have a single interrupt per packet. Flow control should be programmed to apply in the case when the system can't pull out packets fast enough, i.e send a pause only when you run out of rx buffers. There are some tradeoffs with hardware flow control. If the driver makes receive buffers available to the hardware one by one, then under load up to 50% of the packets can end up being flow control packets. Flow control works better if the hardware is notified about buffers in larger bursts. In some cases, NAPI may introduce additional software IRQ latency. On some devices, changing the IRQ mask may be a slow operation, or require additional locking. This overhead may negate any performance benefits observed with NAPI
IRQ race a.k.a rotting packetThe are two common race issues that a driver may have to deal with. These are cases where it is possible to cause the receiver to stop because of hardware and logic interaction. IRQ mask and level-triggeredIf a status bit for receive or rxnobuff is set and the corresponding interrupt-enable bit is not on, then no interrupts will be generated. However, as soon as the "interrupt-enable" bit is unmasked, an immediate interrupt is generated (assuming the status bit was not turned off). Generally the concept of level triggered IRQs in association with a status and interrupt-enable CSR register set is used to avoid the race. If we take the example of the tulip: "pending work" is indicated by
the status bit (CSR5 in tulip). If we cleared the rx ring and proclaimed there was "no more work to be done" and then went on to do a few other things; then when we enable interrupts, there is a possibility that a new packet might sneak in during this phase. It helps to look at the pseudo code for the tulip poll routine: do { ACK; while (ring_is_not_empty()) { work-work-work if quota is exceeded: exit, no touching irq status/mask } /* No packets, but new can arrive while we are doing this*/ CSR5 := read if (CSR5 is not set) { /* If something arrives in this narrow window here, * where the comments are ;-> irq will be generated */ unmask irqs; exit poll; } } while (rx_status_is_set); CSR5 bit of interest is only the rx status. If you look at the last if statement: you just finished grabbing all the packets from the rx ring .. you check if status bit says there are more packets just in ... it says none; you then enable rx interrupts again; if a new packet just came in during this check, we are counting that CSR5 will be set in that small window of opportunity and that by re-enabling interrupts, we would actually trigger an interrupt to register the new packet for processing. non-level sensitive IRQsSome systems have hardware that does not do level triggered IRQs properly. Normally, IRQs may be lost while being masked and the only way to leave poll is to do a double check for new input after netif_rx_complete() is invoked and re-enable polling (after seeing this new input). . . restart_poll: while (ring_is_not_empty()) { work-work-work if quota is exceeded: exit, not touching irq status/mask } . . . enable_rx_interrupts() netif_rx_complete(dev); if (ring_has_new_packet() && netif_rx_reschedule(dev, received)) { disable_rx_and_rxnobufs() goto restart_poll } while (rx_status_is_set); As seen NAPI moves processing to softirq level. Linux uses the ksoftirqd as the general solution to schedule softirq's to run before next interrupt and by putting them under scheduler control. Also this prevents consecutive softirq's from monopolizing the CPU. This also has the effect that the priority of ksoftirq needs to be considered when running very CPU-intensive applications and networking to get the proper balance of softirq/user balance. Increasing ksoftirq priority to 0 (eventually more) is reported to cure problems with low network performance at high CPU load. Most used processes in a GIGE router: USER PID %CPU %MEM SIZE RSS TTY STAT START TIME COMMAND root 3 0.2 0.0 0 0 ? RWN Aug 15 602:00 (ksoftirqd_CPU0) root 232 0.0 7.9 41400 40884 ? S Aug 15 74:12 gated
|