http://lwn.net/Articles/289137/

Multiqueue networking

By Jonathan Corbet
July 8, 2008
One of the fundamental data structures in the networking subsystem is the transmit queue associated with each device. The core networking code will call a driver's hard_start_xmit() function to let the driver know that a packet is ready for transmission; it is then the driver's job to feed that packet into the hardware's transmit queue. The result is a data structure which looks vaguely like this:

[Network transmit queue]

"Vaguely" because the list of sk_buff structures (SKBs - the internal representation of packets) does not exist in this form within the kernel; instead, the driver maintains the queue in a way that the hardware can process it.

This is a scheme which has worked well for years, but it has run into a fundamental limitation: it does not map well to devices which have multiple transmit queues. Such devices are becoming increasingly common, especially in the wireless networking area. Devices which implement the Wireless Multimedia Extensions, for example, can have four different classes of service: video, voice, best-effort, and background. Video and voice traffic may receive higher priority within the device - it is transmitted first - and the device can also take more of the available air time for such packets. On the other hand, the queues for this kind of traffic may be relatively short; if a video packet doesn't get sent on its way quickly, the receiving end will lose interest and move on. So it might be better to just drop video packets which have been delayed for too long.

On the other hand, the "background" level only gets transmitted if there is nothing else to do; it is well-suited to low-priority traffic like bittorrent or email from the boss. It would make sense to have a relatively long queue for background packets, though, to be able to take full advantage of a lull in higher-priority traffic.

Within these devices, each class of service has its own transmit queue. This separation of traffic makes it easy for the hardware to choose which packet to transmit next. It also allows independent limits on the size of each queue; there is no point in filling the device's queue space with background traffic which is not going to be transmitted in any case. But the networking subsystem does not have any built-in support for multiqueue devices. This hardware has been driven using a number of creative techniques which have gotten the job done, but not in an optimal way. That may be about to change, though, with the advent of David Miller's multiqueue transmit patch series.

The current code treats a network device as the fundamental unit which is managed by the outgoing packet scheduler. David's patches change that idea somewhat, since each transmit queue will need to be scheduled independently. So there is a new netdev_queue structure which encapsulates all of the information about a single transmit queue, and which is protected by its own lock. Multiqueue drivers then set up an array of these structures. So the new data structure can, with sufficient imagination, be seen to look something like this:

[Multiqueue tx structure]

Once again, the actual lists of outgoing packets normally exist in the form of special data structures in device-accessible memory. Once the device has these queues set up for it, the various policies associated with each class of service can be implemented. Each queue is managed independently, so more voice packets can be queued even if some other queue (background, say) is overflowing.

David would appear to have worked hard to avoid creating trouble for network driver developers. Drivers for single-queue devices need not be changed at all, and the addition of multiqueue support is relatively straightforward. The first step is to replace the alloc_etherdev() call with a call to:

    struct net_device *alloc_etherdev_mq(int sizeof_priv, 
                                         unsigned int queue_count);

The new queue_count parameter describes the maximum number of transmit queues that the device might support. The actual number in use should be stored in the real_num_tx_queues field of the net_device structure. Note that this value can only be changed when the device is down.

A multiqueue driver will get packets destined for any queue via the usual hard_start_xmit() function. To determine which queue to use, the driver should call:

    u16 skb_get_queue_mapping(struct sk_buff *skb);

The return value is an index into the array of transmit queues. One might well wonder how the networking core decides which queue to use in the first place. That is handled via a new net_device callback:

    u16	(*select_queue)(struct net_device *dev, struct sk_buff *skb);

The patch set includes an implementation of select_queue() which can be used with WME-capable devices.

About the only other required change is for multiqueue drivers to inform the networking core about the status of specific queues. To that end, there is a new set of functions:

    struct netdev_queue *netdev_get_tx_queue(struct net_device *dev,
                                             u16 index);

    void netif_tx_start_queue(struct netdev_queue *dev_queue);
    void netif_tx_wake_queue(struct netdev_queue *dev_queue);
    void netif_tx_stop_queue(struct netdev_queue *dev_queue);

A call to netdev_get_tx_queue() will turn a queue index into the struct netdev_queue pointer required by the other functions, which can be used to stop and start the queue in the usual manner. Should the driver need to operate on all of the queues at once, there is a set of helper functions:

    void netif_tx_start_all_queues(struct net_device *dev);
    void netif_tx_wake_all_queues(struct net_device *dev);
    void netif_tx_stop_all_queues(struct net_device *dev);

Naturally, there are a few other details to deal with, and the multiqueue interface is likely to evolve somewhat over time. At one point, David was hoping to have this feature ready for inclusion into 2.6.27, but that goal looks overly ambitious now. It does seem that much of the ground work will be merged in the next development cycle, though, meaning that full multiqueue support should be in good shape for merging in 2.6.28.


Reply via email to