so after getting 2 days per year for 6 years, to work on that, I finally got 4.
https://github.com/dtaht/babeld/commits/atomic I (temporarily) ripped out the netlink code and replaced it with system("ip route whatever") to finally get the state machine right without having to fiddle with netlink directly. * can't change the kernel metric * ip route replace rules are weird in linux reach -> unreach: requires a del/add unreach -> reach: can use a replace reach -> reach elsewhere: replace (have not tested interface flipping yet) unreach -> unreach: replace (shouldn't happen) * the semantics of add_route were weird, I changed it to use the position of the new parameters, rather than gate, so as to unconfuse myself. There was at least one other bug. So, this code was originally done the prior to avoid conflicting with "stuck multicast routes" and I don't have (and have never had) a test case that showed that. ? my topology today was: comcast gw ------------------------ couch gw | | | spaceheater ceres dancer couchgw filters all this stuff out (mips) but still chokes somewhere spaceheater and ceres are 2 12 core boxes running babeld and rtod dancer is the box with the new "atomic" code comcast gw is an arm a15 box: (1.8.3 stock right now) * I did in a prior attempt 2 years back discover that attempts to insert/retrieve lots of routes could return EAGAIN, ENOSPC, and a few other things maybe not checked for in the current code. Haven't gone back there. But while doing myself in I did get a couple dancer: ip route add unreachable 172.22.0.172/32 from 0.0.0.0/0 table 254 metric 0 proto 42 netlink_read: recvmsg(): No buffer space available * Another "interesting" I notice with the new code, is I inject 1024 identical routes on ceres and spaceheater via "rtod -r 1024 -H test" at roughly the same time, elsewhere on the net, I end up with something similar to ECMP. root@dancer:~/git/babeld-atomic# ip -6 route | grep fe80::225:90ff:fec1:6252 | wc -l 453 root@dancer:~/git/babeld-atomic# ip -6 route | grep fe80::225:90ff:fec2:2aa3 | wc -l 575 Which is cool... But (?) in the process it also adds the routes, then later on replaces some of them pointing elsewhere, when I figured it would keep the first one it got as the basic metrics are equal. Another odditity is that it will batch up dels, before unreachables. This sort of behavior strikes me as having existed before, but was impossible to see, and perhaps the cause of some issues. Or I missed a state in the state machine, but there's no way (at this low level) that can happen, I think. ... ip route del fcd8:8fca:2dc0:3ff::/64 from ::/0 table 254 metric 0 dev eno1 via fe80::225:90ff:fec1:6252 proto 42 ip route del fcd8:8fca:2dc0::/48 from ::/0 table 254 metric 0 dev eno1 via fe80::225:90ff:fec1:6252 proto 42 ip route add unreachable fcd8:8fca:2dc0:11::/64 from ::/0 table 254 metric 0 proto 42 ip route add unreachable fcd8:8fca:2dc0:12::/64 from ::/0 table 254 metric 0 proto 42 * injecting and removing this many routes causes a burp in arm and mips based daemons... and they go unreachable. (note that I'm doing the injection on a pair of 12 core machines that do end up spinning a core in the xroute add code, but what I'm trying to exercise is the route code on all the other boxes) ANYWAY. the most productive 4 days I've had on babel in years. I was not running bird during this exercise, will try that. I do wish I could get more folk blowing things up with rtod. I need to add ipv4 injection tests to it next. -- Dave Täht CTO, TekLibre, LLC http://www.teklibre.com Tel: 1-831-205-9740 _______________________________________________ Babel-users mailing list [email protected] https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users
