( +iv, Nicolas's address corrected ) Dear Juliusz, Dave, thanks for reply.
First of all I'd like to say I'm new to routing & friends, but I'll try to provide feedback: On Fri, Jun 10, 2016 at 08:47:34PM +0200, Juliusz Chroboczek wrote: > Dear Kirill, > > Thank you very much for the detailed analysis. You are welcome. > If I read you correctly, this looks like a kernel bug: incorrect > invalidation of the route cache. While we have seen some similar bugs in > earlier kernel versions, they were not triggered by something that > simple -- you needed to do some non-trivial rule manipulation in order to > trigger them. Initially I too thought this is incorrect invalidation of the kernel route cache - i.e. some cloned routes were created, and on new route addition the route add procedure somehow logically misses the clone, i.e. if it is in some other subtree or something like that. What we have here is of another kind - it is inherent race condition inside kernel - because after route lookup a cloned route is born with table lock unheld, and then the clone is tried to be inserted into FIB. Yes, there is a check that if some other clone with same /128 address is already in FIB (potentially added there in-between while table lock was unheld) the whole lookup is retried. But if there is no other same-address /128 clone it does not mean the corresponding real route could not change in between table lock was unheld. ( To me it looks it is computationally-expensive, at least with straightforward implementation, to check for whether newly installed cloned route should not be invalidated by any other route installment - as table has to be scanned for other routes that would match. Imho the best way to deal with this is not to have route cache at all - like Linux already does for IPv4, and like it is now in 95% of the cases with Facebook patches for IPv6 (>= 4.2 kernel). ) > What is more -- I believe that babeld is using the same procedure as > Quagga and Bird. Do you understand why Quagga and Bird are not seeing the > same issues ? On Sat, Jun 11, 2016 at 11:26:48AM -0700, Dave Taht also wrote: > Quagga, at least, switched to atomic updates some time ago, I think. > > http://patchwork.quagga.net/patch/1234/ First of all I tend to think in Re6stnet links are changed more frequently compared to usual conditions. The probability to hit the race is higher with high rate of route changes and high traffic. I cannot say we have really high traffic on lab.nexedi.com, but the site is constantly being pulled by our bots requesting raw blob contents from repositories, so let's say we have 15-30-50 requests/second all the time as a background + traffic when humans use the site. Then, if there was no network-wide unreachable route (unreachable 2001:67c:1254::/48 in my original mail), unreachable cache entry would _not_ be created, as cache entries are created only if route lookup finds some entry in FIB, not upon "entry not found". I tend to think many setups maybe do not have network-wide unreachable route, but I'm not sure about this. Regarding Quagga and Bird: I have not used them at all, but after quick glance I can see: Quagga (like Dave already said) uses atomic route updates starting from 2016: http://git.savannah.gnu.org/cgit/quagga.git/tree/zebra/rt_netlink.c?h=quagga-1.0.20160315-12-g5f67888#n1688 http://git.savannah.gnu.org/cgit/quagga.git/tree/zebra/rt_netlink.c?h=quagga-1.0.20160315-12-g5f67888#n1870 http://git.savannah.gnu.org/cgit/quagga.git/commit/?id=0abf6796 Regarding Bird: it used to use NLM_F_REPLACE starting from long ago https://gitlab.labs.nic.cz/labs/bird/commit/2253c9e2 but stopped doing so in 2009. https://gitlab.labs.nic.cz/labs/bird/commit/51f4469f I have not looked into details of how NLM_F_REPLACE works (yet ?), but regarding Bird maybe this email might clarify a bit: http://bird.network.cz/pipermail/bird-users/2015-August/009854.html Once again I do not yet know how NLM_F_REPLACE works, but I hope it can be clarified and used correctly. > While I have no objection to switching to a different API for manipulating > routes, I'd like to first make sure that we understand what's going on here. On Sat, Jun 11, 2016 at 11:26:48AM -0700, Dave Taht also wrote: > I strongly approve of atomic updates and fixing what, if anything, > that breaks... > > I have seen oddities in unreachable p2p routes for years now. I've > suspected a variety of causes - notably getting a icmp route > unreachable before babel could make the switch, but have never tracked > it down. Some of the work I'm doing now could be leveraged to try and > make it happen more often, but a few more pieces on top of this > > https://www.mail-archive.com/[email protected]/msg114172.html > > need to land before I can propagate all the right pieces to the testbed. Regarding making sure we understand what is going on here: Yes. And I think I've described it quite precisely - there is a race between IPv6 route lookups and route changes - a cloned route can be created for route table state which was some time ago, potentially different. I've tried to show it precisely in the timing diagram for two threads doing route change and route lookup in my original email. Please also see below for a program which demonstrates this bug reliably with just one local host. > Oh -- and are you running a stock kernel, or one locally patched? Can you > reproduce the issue on a pristine, recent kernel? We are running pristine latest Debian stable kernels on production. In particular the issue shows itself with e.g. 3.16.7-ckt25-2 (2016-04-08) I've run locally patched kernel only on my notebook, on which I've tried to understood the issue more with tracing. I've prepared a program https://lab.nexedi.com/kirr/iproute2/blob/bd480e66/t/rtcache-torture (also attached to this email) which reproduces the problem in several minutes just on one computer and retested it locally: I can reliably reproduce the issue on pristine Debian 3.16.7-ckt25-2 (on both Atom and Core2 notebooks) and on pristine 3.16.35 on Atom (compiled by me, since Debian kernel team has not yet uploaded 3.16.35 to Jessie). It is always the same: the issue reproduces reliably in several minutes. And it looks like e.g. ----- 8< ---- root@mini:/home/kirr/src/tools/net/iproute2/t# time ./rtcache-torture PING 2222:3333:4444:5555::1(2222:3333:4444:5555::1) 56 data bytes E.E.E.....E......E..E............E...E.. <more output from ping> BUG: Linux mini 3.16.35-mini64 #14 SMP PREEMPT Sun Jun 12 19:41:09 MSK 2016 x86_64 GNU/Linux BUG: Got unexpected unreachable route for 2222:3333:4444:5555::1: unreachable 2222:3333:4444:5555::1 from :: dev lo src 2001:67c:1254:20::1 metric 0 \ cache error -101 route table for root 2222:3333:4444::/48 ---- 8< ---- unicast 2222:3333:4444:5555::/64 dev dum0 proto boot scope global metric 1024 unreachable 2222:3333:4444::/48 dev lo proto boot scope global metric 1024 error -101 ---- 8< ---- route for 2222:3333:4444:5555::1 (once again) unreachable 2222:3333:4444:5555::1 from :: dev lo src 2001:67c:1254:20::1 metric 0 \ cache error -101 users 1 used 4 real 0m49.938s user 0m4.488s sys 0m5.872s ---- 8< ---- The issue should not show itself with kernels >= 4.2, because there the lookup procedure does not take table lock twice, and /128 cache entries are not routinely created (they are created only upon PMTU exception). I'm running Debian testing on my development machine. Currently it has 4.5.5-1 (2016-05-29). I can confirm that /128 route cache entries are not created there just because a route was looked up. Kirill ---- 8< ---- (rtcache-torture) #!/bin/sh -e # torture for IPv6 RT cache, trying to hit the race between lookup,cache-add & route add # http://lists.alioth.debian.org/pipermail/babel-users/2016-June/002547.html tprefix=2222:3333:4444 # "whole-network" prefix for tests /48 tsubnet=$tprefix:5555 # subnetwork for which "to" route will be changed /64 taddr=$tsubnet::1 # test address on $tsubnet # setup for tests: # dum0 dummy device ip link del dev dum0 2>/dev/null || : ip link add dum0 type dummy ip link set up dev dum0 # clean route table for tprefix with only unreachable whole-network route ip -6 route flush root $tprefix::/48 ip -6 route add unreachable $tprefix::/48 ip -6 route flush cache ip -6 route add $tsubnet::/64 dev dum0 # put a lot of requests to rt/rtcache getting route to $taddr trap 'kill $(jobs -p)' EXIT rtgetter() { # NOTE we cannot do this with `ip route get ...` in a loop, as `ip route # get` first takes RTNL lock, and thus will be completely serialized with # e.g. route add and del. # # Ping, like other usually connect/tx activity works without RTNL held. exec ping6 -n -f $taddr } rtgetter & # do route del/route in busyloop; # after route add: check route get $addr is not unreachable while true; do ip -6 route del $tsubnet::/64 dev dum0 ip -6 route add $tsubnet::/64 dev dum0 r=`ip -6 -d -o route get $taddr` if echo "$r" | grep -q unreachable ; then echo echo echo BUG: `uname -a` echo BUG: Got unexpected unreachable route for $taddr: echo "$r" echo echo "route table for root $tprefix::/48" echo "---- 8< ----" ip -6 -d -o route show root $tprefix::/48 echo "---- 8< ----" echo echo "route for $taddr (once again)" ip -6 -d -o -s -s -s route get $taddr exit 1 fi done _______________________________________________ Babel-users mailing list [email protected] http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/babel-users

