[Qemu-devel] How to lock-up your tap-based VM network

Jan Kiszka Mon, 12 Apr 2010 09:45:39 -0700

Hi,

we found an ugly issue of the (pseudo) flow-control mechanism in
tap-based networks:


In recent Linux kernels (>= 2.6.30), the tun driver does TX queue length
accounting and stops sending packets if any local receiver does not
return enough of them. This aims at throttling the TX side when the RX
side is temporarily not able to run (e.g. because of CPU
overcommitment). Before that, there was the risk of dropping packets in
this scenario. Unfortunately this approach is fragile and even
counterproductive in some scenarios.

It is fragile as accounting is done based on skb->truesize on sender
side while its purely packet counting on the receiver side.
net/tap-linux.c claimes:

> /* sndbuf should be set to a value lower than the tx queue
>  * capacity of any destination network interface.
>  * Ethernet NICs generally have txqueuelen=1000, so 1Mb is
>  * a good default, given a 1500 byte MTU.
>  */
> #define TAP_DEFAULT_SNDBUF 1024*1024

This works for maximum-sized packets, but fails for minimum-sized ones.

But things get worse: Consider a local bridge with two VMs attached via
taps, and maybe a third interface used to connect to the world. If one
VM decides to shutdown its interface, it will queue packets directed to
it or sent as multicast to the bridge - 500 by default until it overruns
and finally starts dropping. If most of those packets came from the
other VM, that one will ran out of resources before that point! Simple
test: ifdown on the one side, ping -b -s 1472 on the other, and you will
lock out the second VM. This has happened in the field, creating some
unhappy customer. I see the point in avoiding packet drops, but this can
only work as best effort and must not cause such deadlocks.

A major reason for this deadlock could likely be removed by shutting
down the tap (if peered) or dropping packets in user space (in case of
vlan) when a NIC is stopped or otherwise shut down. Currently most (if
not all) NIC models seem to signal both "queue full" and "RX disabled"
via !can_receive(). This should be changed, probably by returning a
reason for "can't receive" so that the network layer can decide what to do.

Opinions? Better suggestions?

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

[Qemu-devel] How to lock-up your tap-based VM network

Reply via email to