Another issue I found during my tests last days, is a problem with BQL,
and more generally when driver stops/starts the queue.
When under stress and BQL stops the queue, driver TX completion does a
lot of work, and servicing CPU also takes over further qdisc_run().
The work-flow is :
1) collect up to 64 (or 256 packets for mlx4) packets from TX ring, and
unmap things, queue skbs for freeing.
2) Calls netdev_tx_completed_queue(ring->tx_queue, packets, bytes);
if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state))
netif_schedule_queue(dev_queue);
This leaves a very tiny window where other cpus could grab __QDISC_STATE_SCHED
(They absolutely have no chance to grab it)
So we end up with one cpu doing the ndo_start_xmit() and TX completions,
and RX work.
This problem is magnified when XPS is used, if one mono-threaded application
deals with
thousands of TCP sockets.
We should use an additional bit (__QDISC_STATE_PLEASE_GRAB_ME) or some way
to allow another cpu to service the qdisc and spare us.