[Babel-users] 64k routes, bird, babel rfc, rtod, etc

Dave Taht Thu, 08 Nov 2018 09:47:21 -0800

I did a ton of testing on my boat network (4 machines) and the lab
(I've kind of lost track - today it was 7 machines, tomorrow 12)....


* quick summary:

the nlogn branch has a bug in that redistribute local deny does not
work. Haven't been able to find it.

Also needs a correct patch to message.c (attached)

I think bird is misparsing something.

... but ...

staging 32k routes each from two boxes with the nlogn + my uthash
stuff for resend.c eventually (if you stage 2k injections from rtod
over time) gets to carrying 64k routes.

I rather like the uthash thing so far. much better than rbtrees.

* This is an i3 nuc running ubuntu 16....

d@dancer:~/git/rtod$ ip -6 route | grep unreach | wc -l
1
d@dancer:~/git/rtod$ ip -6 route | wc -l
65685

NetworkManager (ubuntu 16) goes nuts and stays nuts until these routes
are finally retracted. (all the cpu fans in the room run high, which
is helpful as winter approaches). The daemon will get behind when
churn happens (so it will do things like late hello and lose the
default route to the main gw)

* an edgerouterx struggles.

odhcpd and dnsmasq (also listening on the kernel netlink socket) eat
25% of cpu each while babel drops packets

root@edgerouterx:~# ip -6 route | wc -l
16561

* My Arm dual core a15 did considerably better

but with odchpd spinning away it stopped serving dhcp renews...

* Systemd (ubuntu 18) based boxes don't have any other daemon go nuts.

(however those are 12 core machines, I'll have to go try a weaker box).

Also ubuntu 18 (systemd?) uses a default metric of 100 for its dhcp
default route. Historically I've always had to modify
/etc/dhcp/dhcpd.conf to not request a default route.... yea....

* Using an:

in ip fc00::/8 ge 8 deny
in src-ip fc00::/8

on my core (mips) routers ignore these rtod routes and these motor
happily along.

* When you get to having this many routes you hit other bugs.

root@ceres:~/git/rtod# ip -6 route | wc -l
73528
root@ceres:~/git/rtod# ip -6 route flush proto 50
Failed to send flush request: No such process
Flush terminated
root@ceres:~/git/rtod# ip -6 route flush proto 50
Failed to send flush request: No such process
root@ceres:~/git/rtod# ip -6 route | wc -l
39069
root@ceres:~/git/rtod# ip -6 route flush proto 50

* As for bird

which is running on a weak atom box... peaked at about 32k routes. Limit?

if I fire it up during all this carnage, weird things happen. Whether
that's bird doing something weird, messing with metrics, or the cost
of a route dump, or what, dunno...
I had about 1000 seconds left before the lab network is usuable again,
which I'm using to write this email.  :)

** gotta fix this:

Received prefix with no router id.
Couldn't parse packet (8, 12) from fe80::230:18ff:fec9:de9c on eno1.

** the carnage of retracting this many routes was "interesting"

** Probably most importantly... I get "permanent" wierdness
(metrics? misparsing something?), like for example, on my 172.22.0.0
network... I end up with a route announced and stuck to the bird box
here:

172.22.0.2 via 172.22.0.85 dev eno1 proto babel onlink

and I should probably filter out any announcements of in 172.22.0.0/24
ge 24 universally, but!

.2 has redistribute local deny and is not running the nlogn branch
with that bug in it, and this does not happen with babeld exclusively
on the network. So I think this is a genuine bird bug.

* 20k routes over mcast over 2.4ghz wifi 1mbit

is *really* disabling of the link. With unicast... barely noticed it.

* Anyway, in conclusion...

0) both unicast and mcast in the babel rfc branch work. I have tons of
packet captures, the only weird thing I saw with unicast was over
openwrt wifi where I periodically and too often (IMHO) get an icmpv6
unreachable message when I should have seen a RA solicit happen
somewhere around it.

1) I'd like it if babeld/bird made absolutely sure hellos went out on
time, no matter how much compute was being used. (it's not just
network manager/odhcpd/dnsmasq but anything else that eats cpu like
(for example) my nas server doing a backup over scp...) It might also
be nice to try to *get* hellos faster. Is it possible to have both a
mcast and a unicast socket on the babel port open at the same time?

2) Only a crazy person should try to make babeld carry more than 8k
routes in its current incarnation on cheapo mips hardware. :) I can
certainly see "staging" and "pacing" and stretching hellos and route
announcement intervals, under load, as a stabler way to get to 64k+
routes, now that it appears we are no longer cpu bound within babel
itself, on slightly higher end hardware, to get that far.

For the record, the bgp route table is about 38k? routes nowadays


* Next steps for me...

go test the src-pref stuff
gotta go climb a tree
build some virtual topologies again
it would be good to have a coherent test suite rather than me
flailing, and simple testing trying to get a usable, observable result
(with sane numbers of routes).

hmac? :puppy dog eyes:

-- 

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740

From 53e3882f8ef4d4d2939fb29226f6b6b4ecb071e6 Mon Sep 17 00:00:00 2001
From: Dave Taht <[email protected]>
Date: Thu, 8 Nov 2018 06:55:31 -0800
Subject: [PATCH 1/6] Re-re-re fix message.c ifup test

---
 message.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/message.c b/message.c
index 41e1031..043e4b6 100644
--- a/message.c
+++ b/message.c
@@ -1903,9 +1903,11 @@ send_request_resend(const unsigned char *prefix, unsigned char plen,
                       id, neigh->ifp, resend_delay);
     } else {
         struct interface *ifp;
-        FOR_ALL_INTERFACES(ifp)
+        FOR_ALL_INTERFACES(ifp) {
+	    if(!if_up(ifp)) continue;
             send_multihop_request(&ifp->buf, prefix, plen, src_prefix, src_plen,
                                   seqno, id, 127);
+	}
     }
 }
 
-- 
2.7.4

From a7d3ace902677054b88c10b9fbd04d05b5edba42 Mon Sep 17 00:00:00 2001
From: Dave Taht <[email protected]>
Date: Wed, 7 Nov 2018 15:15:52 -0800
Subject: [PATCH 3/6] Increase maxmaxroutes to an unreasonable value

Now that we can carry more routes, try to carry them.
---
 xroute.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xroute.c b/xroute.c
index fbd5ad5..c54d015 100644
--- a/xroute.c
+++ b/xroute.c
@@ -341,7 +341,7 @@ check_xroutes(int send_updates)
     struct filter_result filter_result;
     int numroutes, numaddresses;
     static int maxroutes = 8;
-    const int maxmaxroutes = 16 * 1024;
+    const int maxmaxroutes = 256 * 1024;
 
     debugf("\nChecking kernel routes.\n");
 
-- 
2.7.4

From cc4d123f782c1714930a12999ae69a45be7d6260 Mon Sep 17 00:00:00 2001
From: Dave Taht <[email protected]>
Date: Wed, 7 Nov 2018 15:12:51 -0800
Subject: [PATCH 2/6] Log late hellos

A late hello is an early warning sign of a cpu overload or
overbuffering.
---
 neighbour.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/neighbour.c b/neighbour.c
index 7575a72..8bf53a2 100644
--- a/neighbour.c
+++ b/neighbour.c
@@ -151,6 +151,11 @@ update_neighbour(struct neighbour *neigh, struct hello_history *hist,
                 missed_hellos = 0;
                 rc = 1;
             } else if(missed_hellos < 0) {
+                /* Late hello. Probably due to the link layer buffering
+                   packets during a link outage or a cpu overload. */
+                   fprintf(stderr,
+                        "Late hello: bufferbloated neighbor %s\n",
+                         format_address(neigh->address));
                 hist->reach <<= -missed_hellos;
                 missed_hellos = 0;
                 rc = 1;
-- 
2.7.4

_______________________________________________
Babel-users mailing list
[email protected]
https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users

[Babel-users] 64k routes, bird, babel rfc, rtod, etc

Reply via email to