> https://lab.nexedi.com/kirr/iproute2/blob/bd480e66/t/rtcache-torture > (also attached to this email) > > which reproduces the problem in several minutes just on one computer and > retested it locally: I can reliably reproduce the issue on pristine > Debian 3.16.7-ckt25-2 (on both Atom and Core2 notebooks) and on pristine > 3.16.35 on Atom (compiled by me, since Debian kernel team has not yet > uploaded 3.16.35 to Jessie).
I have been running this script on four different machines for hours now without reproducing your bug on the 4.4 or later kernels. It does trigger on a 3.14 kernel. (it helps to do a killall fping6 before exiting!) It does not seem to be happening on 4.4 or later. At one level, I'm relieved - one last babel bug to worry about in openwrt (now 4.4 based), although one of the platforms I work on is still stuck at 3.18, as is the 3.14 c2 (for now). At another level I still really, really, really wanted atomic updates in babel, and was clearing the decks to make a run at the right netlink stuff when I'd decided to confirm your bug existed or not in my kernels. :(. Weirdly demotivating. d@dancer:~/bin$ ssh root@pi3 uname -a Linux pi3 4.4.12-v7+ #892 SMP Thu Jun 2 15:41:19 BST 2016 armv7l GNU/Linux d@dancer:~/bin$ ssh root@pi2 uname -a Linux pi2 4.4.12-v7+ #892 SMP Thu Jun 2 15:41:19 BST 2016 armv7l GNU/Linux d@dancer:~/bin$ uname -a Linux dancer 4.5.0-rc7-fqfi #1 SMP PREEMPT Mon Mar 7 16:04:17 PST 2016 x86_64 x86_64 x86_64 GNU/Linux ... The odroid C2 has the bug. d@dancer:~/bin$ ssh root@c2 uname -a Linux c2 3.14.29-56 #1 SMP PREEMPT Wed Apr 20 12:15:54 BRT 2016 aarch64 aarch64 aarch64 GNU/Linux BUG: Got unexpected unreachable route for 2226:3333:4444:5555::1: # I'd changed the number unreachable 2226:3333:4444:5555::1 from :: dev lo src fd99::2 metric 0 \ cache error -101 route table for root 2226:3333:4444::/48 ---- 8< ---- unicast 2226:3333:4444:5555::/64 dev dum0 proto boot scope global metric 1024 unreachable 2226:3333:4444::/48 dev lo proto boot scope global metric 1024 error -101 ---- 8< ---- route for 2226:3333:4444:5555::1 (once again) unreachable 2226:3333:4444:5555::1 from :: dev lo src fd99::2 metric 0 \ cache error -101 users 1 used 3 > > It is always the same: the issue reproduces reliably in several minutes. > And it looks like e.g. > > ----- 8< ---- > root@mini:/home/kirr/src/tools/net/iproute2/t# time ./rtcache-torture > PING 2222:3333:4444:5555::1(2222:3333:4444:5555::1) 56 data bytes > E.E.E.....E......E..E............E...E.. > <more output from ping> > > BUG: Linux mini 3.16.35-mini64 #14 SMP PREEMPT Sun Jun 12 19:41:09 MSK > 2016 x86_64 GNU/Linux > BUG: Got unexpected unreachable route for 2222:3333:4444:5555::1: > unreachable 2222:3333:4444:5555::1 from :: dev lo src > 2001:67c:1254:20::1 metric 0 \ cache error -101 > > route table for root 2222:3333:4444::/48 > ---- 8< ---- > unicast 2222:3333:4444:5555::/64 dev dum0 proto boot scope global > metric 1024 > unreachable 2222:3333:4444::/48 dev lo proto boot scope global metric > 1024 error -101 > ---- 8< ---- > > route for 2222:3333:4444:5555::1 (once again) > unreachable 2222:3333:4444:5555::1 from :: dev lo src > 2001:67c:1254:20::1 metric 0 \ cache error -101 users 1 used 4 > > real 0m49.938s > user 0m4.488s > sys 0m5.872s > ---- 8< ---- > > The issue should not show itself with kernels >= 4.2, because there the > lookup procedure does not take table lock twice, and /128 cache entries > are not routinely created (they are created only upon PMTU exception). > > I'm running Debian testing on my development machine. Currently it has > 4.5.5-1 (2016-05-29). I can confirm that /128 route cache entries are > not created there just because a route was looked up. > > Kirill > > > ---- 8< ---- (rtcache-torture) > #!/bin/sh -e > # torture for IPv6 RT cache, trying to hit the race between lookup,cache-add > & route add > # http://lists.alioth.debian.org/pipermail/babel-users/2016-June/002547.html > > > tprefix=2222:3333:4444 # "whole-network" prefix for tests /48 > tsubnet=$tprefix:5555 # subnetwork for which "to" route will be changed > /64 > taddr=$tsubnet::1 # test address on $tsubnet > > # setup for tests: > > # dum0 dummy device > ip link del dev dum0 2>/dev/null || : > ip link add dum0 type dummy > ip link set up dev dum0 > > # clean route table for tprefix with only unreachable whole-network route > ip -6 route flush root $tprefix::/48 > ip -6 route add unreachable $tprefix::/48 > ip -6 route flush cache > > ip -6 route add $tsubnet::/64 dev dum0 > > > # put a lot of requests to rt/rtcache getting route to $taddr > trap 'kill $(jobs -p)' EXIT > rtgetter() { > # NOTE we cannot do this with `ip route get ...` in a loop, as `ip route > # get` first takes RTNL lock, and thus will be completely serialized with > # e.g. route add and del. > # > # Ping, like other usually connect/tx activity works without RTNL held. > exec ping6 -n -f $taddr > } > rtgetter & > > # do route del/route in busyloop; > # after route add: check route get $addr is not unreachable > while true; do > ip -6 route del $tsubnet::/64 dev dum0 > ip -6 route add $tsubnet::/64 dev dum0 > r=`ip -6 -d -o route get $taddr` > if echo "$r" | grep -q unreachable ; then > echo > echo > echo BUG: `uname -a` > echo BUG: Got unexpected unreachable route for $taddr: > echo "$r" > echo > echo "route table for root $tprefix::/48" > echo "---- 8< ----" > ip -6 -d -o route show root $tprefix::/48 > echo "---- 8< ----" > echo > echo "route for $taddr (once again)" > ip -6 -d -o -s -s -s route get $taddr > exit 1 > fi > done -- Dave Täht Let's go make home routers and wifi faster! With better software! http://blog.cerowrt.org _______________________________________________ Babel-users mailing list [email protected] http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/babel-users

