On Thu, Jun 16, 2016 at 08:38:49AM -0700, Dave Taht wrote: > On Thu, Jun 16, 2016 at 4:17 AM, Kirill Smelkov <[email protected]> wrote: > > On Wed, Jun 15, 2016 at 12:56:34PM +0200, Juliusz Chroboczek wrote: > >> >> If I read you correctly, this looks like a kernel bug: incorrect > >> >> invalidation of the route cache. > >> > >> [...] > >> > >> > What we have here is of another kind - it is inherent race condition > >> > inside kernel > >> > >> Perhaps I'm confused, but it still looks like a kernel bug to me. > > > > Yes, it is a kernel bug. But in a sense it is so old and so widespread > > that it has to be cared about in userspace - as with atomic route > > updates we do not hit it. > > > > Also: atomic route updates are needed not only for avoiding this bug. > > Another reason is: if we have routedel & routeadd pair, even after > > routeadd the state of cache is correct, in the time between del & add, > > if a packet destined to that route gets to the node, it hits > > 'unreachable' route case. > > > > For usual packets it is only "packet lost" and TCP probably retransmits. > > But for SYN packets, e.g. when a connection is going to be established, > > ICMP error is returned which results in "host unreachable" error on > > originator side. > > Yes this variant of the bug is still there, essentially, and it bugs me. > > (btw the facebook page you pointed to fixes they did was fascinating - > they have "interesting problems" - like dealing with 1+m routes in > their route table) > > one day a year, for several years now, I get sufficiently irked about > the atomic update problem in babel to refresh my knowledge of netlink, > hack babel all to hell, and have nothing work. I left myself a bunch > more breadcrumbs last night in my hacked up babel version, as to what > I tried and what it did wrong... (because I'm actually also chasing > another bug which I'll put up in another message).... > > But: > > Why doing the equivalent of this (and understanding how it does it) > > ip -6 route add fd99::33/128 via fe80::120d:7fff:fe64:c992 dev eno1 > ip -6 route replace fd99::33/128 via fe80::120d:7fff:fe64:c991 dev wlp2s0 > > is so hard for me to figure out - that I don't understand. But it > seems to require completely tracing through the ip route code, and > writing a decoder for the netlink packets created, to figure out why > what I thought would be an equivalent for babel, and taking the week > or more to do it... > > -- look! Squirrel!
Dave, maybe this might help you: Wireshark (not tcpdump) has decoder for netlink route packets: https://code.wireshark.org/review/gitweb?p=wireshark.git;a=blob;f=epan/dissectors/packet-netlink-route.c;hb=v2.1.1rc0-170-gc269684 so you can create a virtual netlink monitor interface - something along the lines of modprobe nlmon ip link add type nlmon ip link set nlmon0 up ( see more details in e.g. https://patchwork.ozlabs.org/patch/259444/ ) and see the actual packets exchanged between iproute and kernel. Also: there is pyroute2 (https://github.com/svinota/pyroute2) which has debug decoder for netlink packets, but out of the box you have to specify packet type explicitly: https://github.com/svinota/pyroute2/blob/master/docs/debug.rst Maybe you already know all this, but I decided to provide info anyway to make sure it is not missed, because you mentioned it is hard for you to understand what is going on underneath `ip -6 ...` Hope this might help, Kirill > >> Perhaps it would make sense to speak to netdev about that? > > > > Yes, makes sense. Though as this particular case is not present on 4.2+ > > kernels, people on netdev will probably has less interest to look into. > > > > I will see what can be done. > > > >> > Quagga, at least, switched to atomic updates some time ago, I think. > >> > > >> > http://patchwork.quagga.net/patch/1234/ > >> > >> I see. I'm busy right now, but I'll be grateful for a patch. > > > > I see about this. Thanks for feedback. > > > > > > On Wed, Jun 15, 2016 at 07:35:05PM -0700, Dave Taht wrote: > >> > https://lab.nexedi.com/kirr/iproute2/blob/bd480e66/t/rtcache-torture > >> > (also attached to this email) > >> > > >> > which reproduces the problem in several minutes just on one computer and > >> > retested it locally: I can reliably reproduce the issue on pristine > >> > Debian 3.16.7-ckt25-2 (on both Atom and Core2 notebooks) and on pristine > >> > 3.16.35 on Atom (compiled by me, since Debian kernel team has not yet > >> > uploaded 3.16.35 to Jessie). > >> > >> I have been running this script on four different machines for hours > >> now without reproducing your bug on the 4.4 or later kernels. It does > >> trigger on a 3.14 kernel. (it helps to do a killall fping6 before > >> exiting!) > >> > >> It does not seem to be happening on 4.4 or later. At one level, I'm > >> relieved - one last babel bug to worry about in openwrt (now 4.4 > >> based), although one of the platforms I work on is still stuck at > >> 3.18, as is the 3.14 c2 (for now). > >> > >> At another level I still really, really, really wanted atomic updates > >> in babel, and was clearing the decks to make a run at the right > >> netlink stuff when I'd decided to confirm your bug existed or not in > >> my kernels. :(. Weirdly demotivating. > >> > >> > >> d@dancer:~/bin$ ssh root@pi3 uname -a > >> Linux pi3 4.4.12-v7+ #892 SMP Thu Jun 2 15:41:19 BST 2016 armv7l GNU/Linux > >> d@dancer:~/bin$ ssh root@pi2 uname -a > >> Linux pi2 4.4.12-v7+ #892 SMP Thu Jun 2 15:41:19 BST 2016 armv7l GNU/Linux > >> d@dancer:~/bin$ uname -a > >> Linux dancer 4.5.0-rc7-fqfi #1 SMP PREEMPT Mon Mar 7 16:04:17 PST 2016 > >> x86_64 x86_64 x86_64 GNU/Linux > >> > >> ... > >> > >> The odroid C2 has the bug. > >> > >> d@dancer:~/bin$ ssh root@c2 uname -a > >> Linux c2 3.14.29-56 #1 SMP PREEMPT Wed Apr 20 12:15:54 BRT 2016 > >> aarch64 aarch64 aarch64 GNU/Linux > >> > >> BUG: Got unexpected unreachable route for 2226:3333:4444:5555::1: # > >> I'd changed the number > >> unreachable 2226:3333:4444:5555::1 from :: dev lo src fd99::2 metric > >> 0 \ cache error -101 > >> > >> route table for root 2226:3333:4444::/48 > >> ---- 8< ---- > >> unicast 2226:3333:4444:5555::/64 dev dum0 proto boot scope global > >> metric 1024 > >> unreachable 2226:3333:4444::/48 dev lo proto boot scope global > >> metric 1024 error -101 > >> ---- 8< ---- > >> > >> route for 2226:3333:4444:5555::1 (once again) > >> unreachable 2226:3333:4444:5555::1 from :: dev lo src fd99::2 metric > >> 0 \ cache error -101 users 1 used 3 > > > > Dave, thanks for confirming and for feedback about this. > > > > Yes, 4.2+ kernels should not have this _particular_ bug, because > > https://git.kernel.org/linus/45e4fd26 reworks ip6_pol_route() for above > > tested case to not lock the route table twice and not to create /128 > > cache entries on lookup when there is a gateway. > > > > BUT > > > > Route cache for IPv6 is still there in new kernels, and sometimes cache > > entries are created. E.g. this happens on PMTU exception, but also for > > lookups without gateway when associated flow has FLOWI_FLAG_KNOWN_NH set > > (I don't yet know what it is yet, but still): > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/net/ipv6/route.c?id=v4.7-rc3-55-gd325ea8#n1089 > > > > etc. > > > > So _related_ problems should be there. They are probably just maybe less > > easily reproducible and less often happening. I have not looked into > > further details though... > > > > And also: as shown above it is better to have atomic route updates even > > without cache issues to get SYN not occasionally rejected in the time of > > route update. > > > > So Dave, please keep up your motivation for fixing this if you were > > going to eventually do so. > > > > Thanks, > > Kirill > > > > P.S. > > > >> (it helps to do a killall fping6 before exiting!) > > > > There is > > > > trap 'kill $(jobs -p)' EXIT > > > > it does not work? > > > > > >> > It is always the same: the issue reproduces reliably in several minutes. > >> > And it looks like e.g. > >> > > >> > ----- 8< ---- > >> > root@mini:/home/kirr/src/tools/net/iproute2/t# time > >> > ./rtcache-torture > >> > PING 2222:3333:4444:5555::1(2222:3333:4444:5555::1) 56 data bytes > >> > E.E.E.....E......E..E............E...E.. > >> > <more output from ping> > >> > > >> > BUG: Linux mini 3.16.35-mini64 #14 SMP PREEMPT Sun Jun 12 19:41:09 > >> > MSK 2016 x86_64 GNU/Linux > >> > BUG: Got unexpected unreachable route for 2222:3333:4444:5555::1: > >> > unreachable 2222:3333:4444:5555::1 from :: dev lo src > >> > 2001:67c:1254:20::1 metric 0 \ cache error -101 > >> > > >> > route table for root 2222:3333:4444::/48 > >> > ---- 8< ---- > >> > unicast 2222:3333:4444:5555::/64 dev dum0 proto boot scope global > >> > metric 1024 > >> > unreachable 2222:3333:4444::/48 dev lo proto boot scope global > >> > metric 1024 error -101 > >> > ---- 8< ---- > >> > > >> > route for 2222:3333:4444:5555::1 (once again) > >> > unreachable 2222:3333:4444:5555::1 from :: dev lo src > >> > 2001:67c:1254:20::1 metric 0 \ cache error -101 users 1 used 4 > >> > > >> > real 0m49.938s > >> > user 0m4.488s > >> > sys 0m5.872s > >> > ---- 8< ---- > >> > > >> > The issue should not show itself with kernels >= 4.2, because there the > >> > lookup procedure does not take table lock twice, and /128 cache entries > >> > are not routinely created (they are created only upon PMTU exception). > >> > > >> > I'm running Debian testing on my development machine. Currently it has > >> > 4.5.5-1 (2016-05-29). I can confirm that /128 route cache entries are > >> > not created there just because a route was looked up. > >> > > >> > Kirill > >> > > >> > > >> > ---- 8< ---- (rtcache-torture) > >> > #!/bin/sh -e > >> > # torture for IPv6 RT cache, trying to hit the race between > >> > lookup,cache-add & route add > >> > # > >> > http://lists.alioth.debian.org/pipermail/babel-users/2016-June/002547.html > >> > > >> > > >> > tprefix=2222:3333:4444 # "whole-network" prefix for tests /48 > >> > tsubnet=$tprefix:5555 # subnetwork for which "to" route will be > >> > changed /64 > >> > taddr=$tsubnet::1 # test address on $tsubnet > >> > > >> > # setup for tests: > >> > > >> > # dum0 dummy device > >> > ip link del dev dum0 2>/dev/null || : > >> > ip link add dum0 type dummy > >> > ip link set up dev dum0 > >> > > >> > # clean route table for tprefix with only unreachable whole-network route > >> > ip -6 route flush root $tprefix::/48 > >> > ip -6 route add unreachable $tprefix::/48 > >> > ip -6 route flush cache > >> > > >> > ip -6 route add $tsubnet::/64 dev dum0 > >> > > >> > > >> > # put a lot of requests to rt/rtcache getting route to $taddr > >> > trap 'kill $(jobs -p)' EXIT > >> > rtgetter() { > >> > # NOTE we cannot do this with `ip route get ...` in a loop, as `ip > >> > route > >> > # get` first takes RTNL lock, and thus will be completely serialized > >> > with > >> > # e.g. route add and del. > >> > # > >> > # Ping, like other usually connect/tx activity works without RTNL > >> > held. > >> > exec ping6 -n -f $taddr > >> > } > >> > rtgetter & > >> > > >> > # do route del/route in busyloop; > >> > # after route add: check route get $addr is not unreachable > >> > while true; do > >> > ip -6 route del $tsubnet::/64 dev dum0 > >> > ip -6 route add $tsubnet::/64 dev dum0 > >> > r=`ip -6 -d -o route get $taddr` > >> > if echo "$r" | grep -q unreachable ; then > >> > echo > >> > echo > >> > echo BUG: `uname -a` > >> > echo BUG: Got unexpected unreachable route for $taddr: > >> > echo "$r" > >> > echo > >> > echo "route table for root $tprefix::/48" > >> > echo "---- 8< ----" > >> > ip -6 -d -o route show root $tprefix::/48 > >> > echo "---- 8< ----" > >> > echo > >> > echo "route for $taddr (once again)" > >> > ip -6 -d -o -s -s -s route get $taddr > >> > exit 1 > >> > fi > >> > done _______________________________________________ Babel-users mailing list [email protected] http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/babel-users

