On 5/1/19 10:03 AM, Alan Maguire wrote: > I'm seeing the following repeated error > > [ 130.821362] neighbour: arp_cache: neighbor table overflow! > > when using MPLSoverGRE or MPLSoverUDP tunnels on bits synced > with bpf-next as of this morning. The test script below reliably > reproduces the problem, while working fine on a 4.14 (I haven't > bisected yet). It can be run with no arguments, or specifying > gre or udp for the specific encap type. > > It seems that every MPLS-encapsulated outbound packet is attempting > to add a neighbor entry, and as a result we hit the > net.ipv4.neigh.default.gc_thresh3 limit quickly. > > When this failure occurs, the arp table doesn't show any of > these additional entries. Existing arp table entries are > disappearing too, so perhaps they are being recycled when the > table becomes full? >
There are 2 bugs: 1. neigh_xmit fails to find a neighbor entry on every single Tx. This was introduced by: cd9ff4de010 ("ipv4: Make neigh lookup keys for loopback/point-to-point devices be INADDR_ANY") Basically, the primary_key is reset to 0 for tun's but the neigh_xmit lookup was not corrected. That caused a new neigh entry to be created on each packet Tx, but before inserting the new one to the table the create function looks to see if an entry already exists. The arp constructor had reset the key to 0 in the new neighbor entry so the second lookup finds a match and the new one is dropped. That exposed a second bug. 2. neigh_alloc bumps the gc_entries counter when a new one is allocated, but ___neigh_create is not dropping the counter in the error path. Ian reported a similar problem, but we were not able to isolate the cause. Thanks for the script - very helpful in resolving the bugs. I made some changes to it and I plan to submit it to selftests as a starter for mpls tests. Bug fix patches coming.