http://www.linuxfoundation.org/collaborate/workgroups/networking/napi

NAPI ("New API") is a modification to the device driver packet processing framework, which is designed to improve the performance of high-speed networking. NAPI works through:

Interrupt mitigation 
High-speed networking can create thousands of interrupts per second, all of which tell the system something it already knew: it has lots of packets to process. NAPI allows drivers to run with (some) interrupts disabled during times of high traffic, with a corresponding decrease in system load.
Packet throttling 
When the system is overwhelmed and must drop packets, it's better if those packets are disposed of before much effort goes into processing them. NAPI-compliant drivers can often cause packets to be dropped in the network adaptor itself, before the kernel sees them at all.

NAPI was first incorporated in the 2.5/2.6 kernel but was also backported to the 2.4.20 kernel.

Note that use of NAPI is entirely optional, drivers will work just fine (though perhaps a little more slowly) without it.
A driver may continue using the old 2.4 technique for interfacing to the network stack and not benefit from the NAPI changes. NAPI additions to the kernel do not break backward compatibility.

Contents


NAPI Driver design

The following is a whirlwind tour of what must be done to create a NAPI-compliant network driver.

The first step is to make some changes to your driver's interrupt handler. If your driver has been interrupted because a new packet is available, that packet should not be processed at that time. Instead, your driver should disable any further "packet available" interrupts and tell the networking subsystem to poll your driver shortly to pick up all available packets. Disabling interrupts, of course, is a hardware-specific matter between the driver and the adaptor. Arranging for polling is done with a call to:

   void netif_rx_schedule(struct net_device *dev);

An alternative form you'll see in some drivers is:

   if (netif_rx_schedule_prep(dev))
       __netif_rx_schedule(dev);

The end result is the same either way. (If netif_rx_schedule_prep() returns zero, it means that there was already a poll scheduled, and you should not have received another interrupt).

The next step is to create a poll() method for your driver; it's job is to obtain packets from the network interface and feed them into the kernel. The poll() prototype is:

   int (*poll)(struct net_device *dev, int *budget);

The poll() function should process all available incoming packets, much as your interrupt handler might have done in the pre-NAPI days. There are some exceptions, however:

  • Packets should not be passed to netif_rx(); instead, use:
   int netif_receive_skb(struct sk_buff *skb);

  • A new struct net_device field called quota contains the maximum number of packets that the networking subsystem is prepared to receive from your driver at this time. Once you have exhausted that quota, no further packets should be fed to the kernel in this poll() call.
  • The budget parameter also places a limit on the number of packets which your driver may process. Whichever of budget and quota is lower is the real limit.
  • Your driver should decrement dev->quota by the number of packets it processed. The value pointed to by the budget parameter should also be decremented by the same amount.
  • If packets remain to be processed (i.e. the driver used its entire quota), poll() should return a value of one.
  • If, instead, all packets have been processed, your driver should reenable interrupts, turn off polling, and return zero. Polling is stopped with:
   void netif_rx_complete(struct net_device *dev);

The networking subsystem promises that poll() will not be invoked simultaneously (for the same device) on multiple processors.

The final step is to tell the networking subsystem about your poll() method. This, of course, is done in your initialization code when all the other struct net_device fields are set:

   dev->poll = my_poll;
   dev->weight = 16;

The weight field is a measure of the importance of this interface; the number stored here will turn out to be the same number your driver finds in the quota field when poll() is called. If you forget to initialize weight and leave it at zero, poll() will never be called (voice of experience here). Gigabit adaptor drivers tend to set weight to 64; smaller values can be used for slower media.


Hardware Architecture

NAPI, however, requires the following features to be available:

  • DMA ring or enough RAM to store packets in software devices.
  • Ability to turn off interrupts or maybe events that send packets up the stack.

NAPI processes packet events in what is known as dev->poll() method. Typically, only packet receive events are processed in dev->poll(). The rest of the events MAY be processed by the regular interrupt handler to reduce processing latency (justified also because there are not that many of them).

Note, however, NAPI does not enforce that dev->poll() only processes receive events.
Tests with the tulip driver indicated slightly increased latency if all of the interrupt handler is moved to dev->poll(). Also MII/PHY handling gets a little trickier.

The example used in this document is to move the receive processing only to dev->poll(); this is shown with the patch for the tulip driver. For an example of code that moves all the interrupt driver to dev->poll() look at other drivers (tg3, e1000, sky2). There are caveats that might force you to go with moving everything to dev->poll(). Different NICs work differently depending on their status/event acknowledgement setup.

There are two types of event register ACK mechanisms.

  • what is known as Clear-on-read (COR). When you read the status/event register, it clears everything! The natsemi and sunbmac NICs are known to do this. In this case your only choice is to move all to dev->poll()
  • Clear-on-write (COW)
    • you clear the status by writing a 1 in the bit-location you want. These are the majority of the NICs and work the best with NAPI. Put only receive events in dev->poll(); leave the rest in the old interrupt handler.
    • whatever you write in the status register clears every thing.

Can't seem to find any supported by Linux which do this.

  • Ability to detect new work correctly. NAPI works by shutting down event interrupts when there's work and turning them on when there's none. New packets might show up in the small window while interrupts were being re-enabled (described later). A packet might sneak in during the period we are enabling interrupts. We only get to know about such a packet when the next new packet arrives and generates an interrupt. Essentially, there is a small window of opportunity for a race condition which for clarity we'll refer to as the "rotting packet".

This is a very important topic and appendix 2 is dedicated for more discussion.


Locking rules and environmental guarantees

  • Only one CPU at any time can call dev->poll(); this is because only one CPU can pick the initial interrupt and hence the initial netif_rx_schedule(dev)
  • The core layer invokes devices to send packets in a round robin format. This implies receive is totaly lockless because of the guarantee only that one CPU is executing it.
  • Contention can only be the result of some other CPU accessing the rx ring. This happens only in close() and suspend() (when these methods try to clean the rx ring); Driver authors need not worry about this; synchronization is taken care for them by the top net layer.
  • Local interrupts are enabled (if you don't move all to dev->poll()). For example link/MII and txcomplete continue functioning just the same old way. This improves the latency of processing these events. It is also assumed that the receive interrupt is the largest cause of noise. Note this might not always be true. For these broken drivers, move all to dev->poll().

For the rest of this text, we'll assume that dev->poll() only processes receive events.


NAPI API

netif_rx_schedule(dev) 
Called by an IRQ handler to schedule a poll for device
netif_rx_schedule_prep(dev) 
puts the device in a state ready to be added to the CPU polling list if it is up and running. You can look at this as the first half of netif_rx_schedule(dev).
__netif_rx_schedule(dev) 
Add device to the poll list for this CPU; assuming that netif_schedule_prep(dev) has already been called and returned 1
__netif_rx_schedule_prep(dev) 
similar to netif_rx_schedule_prep(dev) but no check if device is up, usually not used
netif_rx_reschedule(dev, undo) 
Called to reschedule polling for device specifically for some deficient hardware.
netif_rx_complete(dev) 
Remove interface from the CPU poll list: it must be in the poll list on current cpu. This primitive is called by dev->poll(), when it completes its work. The device cannot be out of poll list at this call, if it is then clearly it is a BUG().
__netif_rx_complete(dev) 
same as netif_rx_complete but called when local interrupts are already disabled.


Advantages



Performance under high packet load

NAPI provides an "inherent mitigation" which is bound by system capacity as can be seen from the following data collected by Robert Olsson's tests on Gigabit ethernet (e1000):

Psize Ipps Tput Rxint Txint Done Ndone
60 890000 409362 17 27622 7 6823
128 758150 464364 21 9301 10 7738
256 445632 774646 42 15507 21 12906
512 232666 994445 241292 19147 241192 1062
1024 119061 1000003 872519 19258 872511 0
1440 85193 1000003 946576 19505 946569 0

Legend:

Ipps 
input packets per second
Tput 
packets out of total 1M that made it out
Txint 
transmit completion interrupts seen
Done 
The number of times that the poll() managed to pull all packets out of the rx ring. Note from this that the lower the load the more we could clean up the rxring
Ndone 
is the converse of "Done". Note again, that the higher the load the more times we couldn't clean up the rxring.

Observe that when the NIC receives 890Kpackets/sec only 17 rx interrupts are generated. The system can't handle the processing at 1 interrupt/packet at that load level. At lower rates on the other hand, rx interrupts go up and therefore the interrupt/packet ratio goes up (as observable from that table). So there is possibility that under low enough input, you get one poll call for each input packet caused by a single interrupt each time. And if the system can't handle interrupt per packet ratio of 1, then it will just have to chug along.


Use of softirq for other optimizations

NAPI usage does not have to be limited only to receiving packets. With many devices the poll() routine can also be used to manage transmit completion or PHY interface state changes. By moving this processing out of the hardware interrrupt service routine, there may be less latency and better performance.


Hardware Flow control

Most chips with flow control only send a pause packet when they run out of Rx buffers. Since packets are pulled off the DMA ring by a softirq in NAPI, if the system is slow in grabbing them and we have a high input rate (faster than the system's capacity to remove packets), then theoretically there will only be one rx interrupt for all packets during a given packetstorm. Under low load, we might have a single interrupt per packet. Flow control should be programmed to apply in the case when the system can't pull out packets fast enough, i.e send a pause only when you run out of rx buffers.

There are some tradeoffs with hardware flow control. If the driver makes receive buffers available to the hardware one by one, then under load up to 50% of the packets can end up being flow control packets. Flow control works better if the hardware is notified about buffers in larger bursts.


Disadvantages



Latency

In some cases, NAPI may introduce additional software IRQ latency.


IRQ masking

On some devices, changing the IRQ mask may be a slow operation, or require additional locking. This overhead may negate any performance benefits observed with NAPI


Issues



IRQ race a.k.a rotting packet

The are two common race issues that a driver may have to deal with. These are cases where it is possible to cause the receiver to stop because of hardware and logic interaction.


IRQ mask and level-triggered

If a status bit for receive or rxnobuff is set and the corresponding interrupt-enable bit is not on, then no interrupts will be generated. However, as soon as the "interrupt-enable" bit is unmasked, an immediate interrupt is generated (assuming the status bit was not turned off). Generally the concept of level triggered IRQs in association with a status and interrupt-enable CSR register set is used to avoid the race.

If we take the example of the tulip: "pending work" is indicated by the status bit (CSR5 in tulip).
The corresponding interrupt bit (CSR7 in tulip) might be turned off (but the CSR5 will continue to be turned on with new packet arrivals even if we clear it the first time). Very important is the fact that if we turn on the interrupt bit when status is set, then an immediate irq is triggered.

If we cleared the rx ring and proclaimed there was "no more work to be done" and then went on to do a few other things; then when we enable interrupts, there is a possibility that a new packet might sneak in during this phase. It helps to look at the pseudo code for the tulip poll routine:

         do {
                 ACK;
                 while (ring_is_not_empty()) {
                         work-work-work
                         if quota is exceeded: exit, no touching irq status/mask
                 }
                 /* No packets, but new can arrive while we are doing this*/
                 CSR5 := read
                 if (CSR5 is not set) {
                         /* If something arrives in this narrow window here,
                          *  where the comments are ;-> irq will be generated */
                         unmask irqs;
                        exit poll;
                }
        } while (rx_status_is_set);

CSR5 bit of interest is only the rx status.

If you look at the last if statement: you just finished grabbing all the packets from the rx ring .. you check if status bit says there are more packets just in ... it says none; you then enable rx interrupts again; if a new packet just came in during this check, we are counting that CSR5 will be set in that small window of opportunity and that by re-enabling interrupts, we would actually trigger an interrupt to register the new packet for processing.


non-level sensitive IRQs

Some systems have hardware that does not do level triggered IRQs properly. Normally, IRQs may be lost while being masked and the only way to leave poll is to do a double check for new input after netif_rx_complete() is invoked and re-enable polling (after seeing this new input).

 	.
 	. 
 restart_poll:
 	while (ring_is_not_empty()) {
 		work-work-work
 		if quota is exceeded: exit, not touching irq status/mask
 	}
 	.
 	.
 	.
 	enable_rx_interrupts()
 	netif_rx_complete(dev);
 	if (ring_has_new_packet() && netif_rx_reschedule(dev, received)) {
 		disable_rx_and_rxnobufs()
 		goto restart_poll
 	} while (rx_status_is_set);


Basically netif_rx_complete() removes us from the poll list, but because a new packet which will never be caught due to the possibility of a race might come in, we attempt to re-add ourselves to the poll list.


Scheduling issues

As seen NAPI moves processing to softirq level. Linux uses the ksoftirqd as the general solution to schedule softirq's to run before next interrupt and by putting them under scheduler control. Also this prevents consecutive softirq's from monopolizing the CPU. This also has the effect that the priority of ksoftirq needs to be considered when running very CPU-intensive applications and networking to get the proper balance of softirq/user balance. Increasing ksoftirq priority to 0 (eventually more) is reported to cure problems with low network performance at high CPU load.

Most used processes in a GIGE router:

 USER  PID  %CPU %MEM  SIZE   RSS TTY STAT START     TIME COMMAND
 root    3  0.2  0.0     0     0  ?   RWN  Aug 15  602:00 (ksoftirqd_CPU0)
 root  232  0.0  7.9 41400 40884  ?   S    Aug 15   74:12 gated


External Links




Reply via email to