Hi, we found an ugly issue of the (pseudo) flow-control mechanism in tap-based networks:
In recent Linux kernels (>= 2.6.30), the tun driver does TX queue length accounting and stops sending packets if any local receiver does not return enough of them. This aims at throttling the TX side when the RX side is temporarily not able to run (e.g. because of CPU overcommitment). Before that, there was the risk of dropping packets in this scenario. Unfortunately this approach is fragile and even counterproductive in some scenarios. It is fragile as accounting is done based on skb->truesize on sender side while its purely packet counting on the receiver side. net/tap-linux.c claimes: > /* sndbuf should be set to a value lower than the tx queue > * capacity of any destination network interface. > * Ethernet NICs generally have txqueuelen=1000, so 1Mb is > * a good default, given a 1500 byte MTU. > */ > #define TAP_DEFAULT_SNDBUF 1024*1024 This works for maximum-sized packets, but fails for minimum-sized ones. But things get worse: Consider a local bridge with two VMs attached via taps, and maybe a third interface used to connect to the world. If one VM decides to shutdown its interface, it will queue packets directed to it or sent as multicast to the bridge - 500 by default until it overruns and finally starts dropping. If most of those packets came from the other VM, that one will ran out of resources before that point! Simple test: ifdown on the one side, ping -b -s 1472 on the other, and you will lock out the second VM. This has happened in the field, creating some unhappy customer. I see the point in avoiding packet drops, but this can only work as best effort and must not cause such deadlocks. A major reason for this deadlock could likely be removed by shutting down the tap (if peered) or dropping packets in user space (in case of vlan) when a NIC is stopped or otherwise shut down. Currently most (if not all) NIC models seem to signal both "queue full" and "RX disabled" via !can_receive(). This should be changed, probably by returning a reason for "can't receive" so that the network layer can decide what to do. Opinions? Better suggestions? Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux