I did a ton of testing on my boat network (4 machines) and the lab (I've kind of lost track - today it was 7 machines, tomorrow 12)....
* quick summary: the nlogn branch has a bug in that redistribute local deny does not work. Haven't been able to find it. Also needs a correct patch to message.c (attached) I think bird is misparsing something. ... but ... staging 32k routes each from two boxes with the nlogn + my uthash stuff for resend.c eventually (if you stage 2k injections from rtod over time) gets to carrying 64k routes. I rather like the uthash thing so far. much better than rbtrees. * This is an i3 nuc running ubuntu 16.... d@dancer:~/git/rtod$ ip -6 route | grep unreach | wc -l 1 d@dancer:~/git/rtod$ ip -6 route | wc -l 65685 NetworkManager (ubuntu 16) goes nuts and stays nuts until these routes are finally retracted. (all the cpu fans in the room run high, which is helpful as winter approaches). The daemon will get behind when churn happens (so it will do things like late hello and lose the default route to the main gw) * an edgerouterx struggles. odhcpd and dnsmasq (also listening on the kernel netlink socket) eat 25% of cpu each while babel drops packets root@edgerouterx:~# ip -6 route | wc -l 16561 * My Arm dual core a15 did considerably better but with odchpd spinning away it stopped serving dhcp renews... * Systemd (ubuntu 18) based boxes don't have any other daemon go nuts. (however those are 12 core machines, I'll have to go try a weaker box). Also ubuntu 18 (systemd?) uses a default metric of 100 for its dhcp default route. Historically I've always had to modify /etc/dhcp/dhcpd.conf to not request a default route.... yea.... * Using an: in ip fc00::/8 ge 8 deny in src-ip fc00::/8 on my core (mips) routers ignore these rtod routes and these motor happily along. * When you get to having this many routes you hit other bugs. root@ceres:~/git/rtod# ip -6 route | wc -l 73528 root@ceres:~/git/rtod# ip -6 route flush proto 50 Failed to send flush request: No such process Flush terminated root@ceres:~/git/rtod# ip -6 route flush proto 50 Failed to send flush request: No such process root@ceres:~/git/rtod# ip -6 route | wc -l 39069 root@ceres:~/git/rtod# ip -6 route flush proto 50 * As for bird which is running on a weak atom box... peaked at about 32k routes. Limit? if I fire it up during all this carnage, weird things happen. Whether that's bird doing something weird, messing with metrics, or the cost of a route dump, or what, dunno... I had about 1000 seconds left before the lab network is usuable again, which I'm using to write this email. :) ** gotta fix this: Received prefix with no router id. Couldn't parse packet (8, 12) from fe80::230:18ff:fec9:de9c on eno1. ** the carnage of retracting this many routes was "interesting" ** Probably most importantly... I get "permanent" wierdness (metrics? misparsing something?), like for example, on my 172.22.0.0 network... I end up with a route announced and stuck to the bird box here: 172.22.0.2 via 172.22.0.85 dev eno1 proto babel onlink and I should probably filter out any announcements of in 172.22.0.0/24 ge 24 universally, but! .2 has redistribute local deny and is not running the nlogn branch with that bug in it, and this does not happen with babeld exclusively on the network. So I think this is a genuine bird bug. * 20k routes over mcast over 2.4ghz wifi 1mbit is *really* disabling of the link. With unicast... barely noticed it. * Anyway, in conclusion... 0) both unicast and mcast in the babel rfc branch work. I have tons of packet captures, the only weird thing I saw with unicast was over openwrt wifi where I periodically and too often (IMHO) get an icmpv6 unreachable message when I should have seen a RA solicit happen somewhere around it. 1) I'd like it if babeld/bird made absolutely sure hellos went out on time, no matter how much compute was being used. (it's not just network manager/odhcpd/dnsmasq but anything else that eats cpu like (for example) my nas server doing a backup over scp...) It might also be nice to try to *get* hellos faster. Is it possible to have both a mcast and a unicast socket on the babel port open at the same time? 2) Only a crazy person should try to make babeld carry more than 8k routes in its current incarnation on cheapo mips hardware. :) I can certainly see "staging" and "pacing" and stretching hellos and route announcement intervals, under load, as a stabler way to get to 64k+ routes, now that it appears we are no longer cpu bound within babel itself, on slightly higher end hardware, to get that far. For the record, the bgp route table is about 38k? routes nowadays * Next steps for me... go test the src-pref stuff gotta go climb a tree build some virtual topologies again it would be good to have a coherent test suite rather than me flailing, and simple testing trying to get a usable, observable result (with sane numbers of routes). hmac? :puppy dog eyes: -- Dave Täht CTO, TekLibre, LLC http://www.teklibre.com Tel: 1-831-205-9740
From 53e3882f8ef4d4d2939fb29226f6b6b4ecb071e6 Mon Sep 17 00:00:00 2001 From: Dave Taht <[email protected]> Date: Thu, 8 Nov 2018 06:55:31 -0800 Subject: [PATCH 1/6] Re-re-re fix message.c ifup test --- message.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/message.c b/message.c index 41e1031..043e4b6 100644 --- a/message.c +++ b/message.c @@ -1903,9 +1903,11 @@ send_request_resend(const unsigned char *prefix, unsigned char plen, id, neigh->ifp, resend_delay); } else { struct interface *ifp; - FOR_ALL_INTERFACES(ifp) + FOR_ALL_INTERFACES(ifp) { + if(!if_up(ifp)) continue; send_multihop_request(&ifp->buf, prefix, plen, src_prefix, src_plen, seqno, id, 127); + } } } -- 2.7.4
From a7d3ace902677054b88c10b9fbd04d05b5edba42 Mon Sep 17 00:00:00 2001 From: Dave Taht <[email protected]> Date: Wed, 7 Nov 2018 15:15:52 -0800 Subject: [PATCH 3/6] Increase maxmaxroutes to an unreasonable value Now that we can carry more routes, try to carry them. --- xroute.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/xroute.c b/xroute.c index fbd5ad5..c54d015 100644 --- a/xroute.c +++ b/xroute.c @@ -341,7 +341,7 @@ check_xroutes(int send_updates) struct filter_result filter_result; int numroutes, numaddresses; static int maxroutes = 8; - const int maxmaxroutes = 16 * 1024; + const int maxmaxroutes = 256 * 1024; debugf("\nChecking kernel routes.\n"); -- 2.7.4
From cc4d123f782c1714930a12999ae69a45be7d6260 Mon Sep 17 00:00:00 2001 From: Dave Taht <[email protected]> Date: Wed, 7 Nov 2018 15:12:51 -0800 Subject: [PATCH 2/6] Log late hellos A late hello is an early warning sign of a cpu overload or overbuffering. --- neighbour.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/neighbour.c b/neighbour.c index 7575a72..8bf53a2 100644 --- a/neighbour.c +++ b/neighbour.c @@ -151,6 +151,11 @@ update_neighbour(struct neighbour *neigh, struct hello_history *hist, missed_hellos = 0; rc = 1; } else if(missed_hellos < 0) { + /* Late hello. Probably due to the link layer buffering + packets during a link outage or a cpu overload. */ + fprintf(stderr, + "Late hello: bufferbloated neighbor %s\n", + format_address(neigh->address)); hist->reach <<= -missed_hellos; missed_hellos = 0; rc = 1; -- 2.7.4
_______________________________________________ Babel-users mailing list [email protected] https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users
